An Epistemic Approach to the Formal Specification of Statistical Machine Learning

We propose an epistemic approach to formalizing statistical properties of machine learning. Specifically, we introduce a formal model for supervised learning based on a Kripke model where each possible world corresponds to a possible dataset and modal operators are interpreted as transformation and testing on datasets. Then we formalize various notions of the classification performance, robustness, and fairness of statistical classifiers by using our extension of statistical epistemic logic (StatEL). In this formalization, we show relationships among properties of classifiers, and relevance between classification performance and robustness. As far as we know, this is the first work that uses epistemic models and logical formulas to express statistical properties of machine learning, and would be a starting point to develop theories of formal specification of machine learning.


Introduction
With the increasing use of machine learning in reallife applications, the safety and security of learningbased systems have been of great interest. In particular, many recent studies [40], [12] have found vulnerabilities on the robustness of deep neural networks (DNNs) to malicious inputs, which can lead to disasters in security critical systems, such as self-driving cars. To find out these vulnerabilities in advance, there have been researches on the formal verification and testing methods for the robustness of DNNs in recent years [23,26,35,41]. However, relatively little attention has been paid to the formal specification of machine learning [38].
In the research filed of formal specification and verification, logical approaches have been shown useful to characterize desired properties and to develop theories to discuss those properties. For example, temporal logic [36] is a branch of modal logic for expressing time-dependent propositions, and has been widely used to describe requirements of hardware and software systems. For another example, epistemic logic [44] is a modal logic for knowledge and belief that has been employed as formal policy languages for distributed systems (e.g., for the authentication [8] and the anonymity [39] of security protocols). As far as we know, however, no prior work has employed logical formulas to rigorously describe various statistical properties of machine learning, although there are some papers that (often informally) list various desirable properties of machine learning [38].
In this paper, we present a first logical formalization of statistical properties of machine learning. To describe the statistical properties in a simple and abstract way, we extend statistical epistemic logic (StatEL) [27], which has recently been proposed to describe statistical knowledge and is applied to formalize statistical hypothesis testing and statistical privacy of databases.
A key idea in our modeling of statistical machine learning is that we formalize logical aspects in the syntax level, and statistical distances and dataset operations in the semantics level by using accessibility rela-tions of a Kripke model [30]. In this model, we formalize supervised learning and some of its desirable properties, including performance, robustness, and fairness. More specifically, classification performance and robustness are described as the differences between the correct class label and the classifier's prediction, whereas fairness is expressed as a conditional indistinguishability between different groups.
Our contributions. The main contributions of this work are as follows: -We propose a logical approach to formalizing statistical properties of machine learning in a simple and abstract way. Specifically, we introduce a principle that logical aspects of statistical properties are described in the syntax level, and statistical distances and datasets are formalized in the semantics level. -We formalize supervised learning models and test datasets (used to check whether the learning models satisfy specification) by employing a distributional Kripke model [27] where each possible world corresponds to a possible test dataset, and modal operators are interpreted as transformation and testing on datasets. Then we show how the sampling from a dataset and non-deterministic adversarial inputs are formalized in the distributional Kripke model. -We propose an extension of statistical epistemic logic (StatEL) as a formal language to describe various properties of machine learning models, including the performance, robustness, and fairness of statistical classifiers. Then the satisfaction of logical formulas representing those properties is associated with their testing using a test dataset. As far as we know, this is the first work that uses logical formulas to formalize various statistical properties of machine learning, and that provides an epistemic view on those properties. -We show some relationships among properties of classifiers, such as different levels of robustness. We also present certain relationships between classification performance and robustness, which suggest robustness-related properties that have not been formalized in the literature as far as we know.
Cautions and limitations. In this paper, we focus on formalizing properties of supervised learning models that may be tested by using a dataset; i.e., we do not deal with unsupervised learning, reinforcement learning, the properties of learning algorithms, quality of training data (e.g., sample bias), quality of testing (e.g., coverage criteria), explainability, temporal properties, or system-level specification. It should be noted that most of the properties formalized in this paper have been known in machine learning literatures, and the novelty of this work lies in the logical formulation of those statistical properties. We also highlight that this work aims to provide a logical approach to the modeling of statistical properties tested with a dataset, and does not present methods for checking, guaranteeing, or improving the performance/robustness/fairness of machine learning models. As for the satisfiability of logical formulas, we leave the development of testing and (statistical) model checking algorithms as future work, since the research area on the testing and verification of machine learning is relatively new and needs further techniques to improve the scalability. Moreover, in some applications such as image recognition, some atomic formulas (e.g., representing whether an input image is a panda) cannot be defined mathematically, and require additional techniques based on experiments. Nevertheless, we demonstrate that describing various properties using logical formulas is useful to explore desirable properties and to discuss their relationships in a framework.
Finally, we emphasize that our work is the first attempt to use epistemic models and logical formulas to express statistical properties of machine learning models, and would be a starting point to develop theories of formal specification of machine learning in future research.
Relationship with the preliminary version. The main novelties of this paper with respect to the preliminary version [28] are as follows: -We add how the satisfaction of a formula at a possible world can be regarded as the testing of a specification using a test dataset (Sect. 3.1). -We show how modal operators are used to model the transformation and testing on datasets. For example, data preparation T (e.g., data cleaning, data augmentation) can also be formalized as a modal operator ∆ T (Sect. 3.2). -We re-interpret the non-classical implication ⊃ for conditional probabilities in StatEL as a modal operator associated with a conditioning relation (Sect. 3.3). -We introduce a modal operator ∼ ε,D x for conditional indistinguishability (Sect. 3.4). Then we provide a more comprehensible formalization of the fairness of supervised learning (Sect. 7) without using counterfactual epistemic operators [28], because the formalization using these operators requires an additional formula and makes the presentation more complicated and unintuitive.
-We add a formalization of generalization error to capture how accurately a classifier is able to classify previously unseen input data (Sect. 5.3).
-We add a formalization of other fairness notions called separation (Sect. 7.3) and sufficiency (Sect. 7.4) so that this paper covers all three categories of fairness notions [5]. -We show a running example of a pedestrian detection to illustrate the formalization of various notions of performance, robustness, and fairness.
Paper organization. The rest of this paper is organized as follows. Sect. 2 presents notations used in this paper and provides background on statistical distances and statistical epistemic logic (StatEL). Sect. 3 introduces a different view on the modal operators in StatEL and extends the logic with additional operators. Sect. 4 introduces a formal model for describing the behaviors of statistical classifiers and non-deterministic adversarial inputs. Sects. 5, 6, and 7 respectively formalize various notions of the performance, robustness, and fairness of classifiers by using our extension of StatEL. Sect. 8 presents related work and Sect. 9 concludes.

Preliminaries
In this section we introduce some notations, and review background on statistical distance notions and the syntax and semantics of statistical epistemic logic (StatEL), introduced in [27].

Notations
Let R ≥0 be the set of non-negative real numbers, and [0, 1] be the set of non-negative real numbers not greater than 1. We denote by DO the set of all probability distributions over a finite set O. Given a finite set O and a probability distribution µ ∈ DO, the probability of sampling a value v from µ is denoted by

Statistical Distance
We recall popular notions of distance between probability distributions: total variation and ∞-Wasserstein distance.
Informally, total variation between two distributions µ 0 and µ 1 over a set O represents the largest difference between the probabilities that µ 0 and µ 1 assign to an identical subset R of O.
Definition 1 (Total variation) For a finite set O, the total variation D tv of two distributions µ 0 , µ 1 ∈ DO is defined by: We then recall the ∞-Wasserstein metric [43]. Intuitively, the ∞-Wasserstein metric W d (µ 0 , µ 1 ) between two distributions µ 0 , µ 1 represents the minimum largest move between points in a transportation from µ 0 to µ 1 .

Syntax of StatEL
We next recall the syntax of statistical epistemic logic (StatEL) [27], which has two levels of formulas: static and epistemic formulas. Intuitively, a static formula describes a proposition satisfied at a (deterministic) state, while an epistemic formula describes a proposition satisfied at a probability distribution of states. In this paper, the former is used only to define the latter. Formally, let Mes be a set of symbols called measurement variables, and Γ be a set of atomic formulas of the form γ(x 1 , x 2 , . . . , x n ) for a predicate symbol γ, n ≥ 0, and x 1 , x 2 , . . . , x n ∈ Mes. Let I ⊆ [0, 1] be a finite union of disjoint intervals, and A be a finite set of indices (e.g., associated with statistical divergences). Then the formulas are defined by: Static formulas: ψ ::= γ(x 1 , x 2 , . . . , x n ) | ¬ψ | ψ ∧ ψ Epistemic formulas: ϕ :: where a ∈ A. We denote by F the set of all epistemic formulas. Note that we have no quantifiers over measurement variables. (See Sect. 2.5 for more details.) The probability quantification P I ψ represents that a static formula ψ is satisfied with a probability belonging to a set I. For instance, P (0.95,1] ψ represents that ψ holds with a probability greater than 0.95. By 1 A coupling of two distributions µ 0 , µ 1 ∈ DO is a joint distribution µ ∈ D(O × O) such that µ 0 and µ 1 are µ's marginal distributions, i.e., for . For a coupling µ, the support supp(µ) is the maximum subset of O × O whose elements are assigned non-zero probabilities in µ. ψ ⊃ P I ψ we represent that the conditional probability of ψ given ψ is included in a set I. The epistemic knowledge K a ϕ expresses that we know ϕ when our capability of observation is denoted by a ∈ A.

Distributional Kripke Model
Next we recall the notion of a distributional Kripke model [27], where each possible world is associated with a probability distribution over a set of states, and with a stochastic assignment of data to measurement variables.

Definition 3 (Distributional Kripke model) Let
A be a finite set of indices (typically associated with operations and tests on datasets), S be a finite set of states, and O be a finite set of data, called a data do- a non-empty set W of multisets of states belonging to S; for each a ∈ A, an accessibility relation R a ⊆ W×W; for each s ∈ S, a valuation V s : Γ → P(O k ) that maps each k-ary predicate γ to a set V s (γ) of ktuples of data.
The set W is called a universe, and its elements are called possible worlds. A world is said to be finite if it is a finite multiset, i.e., it has a finite number of (possibly duplicated) elements. A world is said to be infinite if it is an infinite multiset.
The relation R a determines an accessibility between two worlds. For example, (w, w ) ∈ R a means that a world w is accessible from a world w when our capability of distinguishing possible worlds is denoted by a ∈ A. The valuation V s may give a possibly different interpretation of a predicate γ at a different state s. We assume that all measurement variables range over the same data domain O in every world. The interpretation of measurement variables at a state s is given by a deterministic assignment σ s defined below.
Definition 4 (Deterministic assignment) For any distributional Kripke model M =(W, (R a ) a∈A , (V s ) s∈S ), we assume that each world w ∈ W is associated with a function ρ w : Mes × S → O that maps each measurement variable x to its value ρ w (x, s) that is observed at a state s belonging to the world w. We also assume that each state s in a world w is associated with the deterministic assignment σ s : Mes → O defined by σ s (x) = ρ w (x, s).
Since each world w is a multiset of states, we abuse the notation and denote by w[s] the probability that a state s is randomly chosen from w (i.e., the number of occurrences of s in the multiset w, divided by the total number of elements in w). Here we regard each world w as a probability distribution over the states that corresponds to the multiset.
The probability that a measurement variable x ∈ Mes has a value v ∈ O is: Note that σ w : Mes → DO maps each measurement variable x to a probability distribution σ w (x) over the data domain O. Hence σ w represents the joint probability distribution of all variables in Mes, and is called the stochastic assignment at w. When a state s is uniformly drawn from a multiset w of states, a datum σ s (x) is sampled from the distribution σ w (x).
In later sections, a possible world corresponds to a dataset (i.e., a multiset of data tuples) from which data are sampled. For example, suppose that we have only three measurement variables Mes = {x, y, z}. Then for each state s in a world w, the deterministic assignment σ s : Mes → O represents the tuple of data (σ s (x), σ s (y), σ s (z)). Hence each state s corresponds to a tuple of data, and the world w corresponds to the dataset {(σ s (x), σ s (y), σ s (z)) | s ∈ w}.

Stochastic Semantics of StatEL
Now we recall the stochastic semantics [27] for the StatEL formulas over a distributional Kripke model The interpretation of a static formulas ψ at a state s is given by: The restriction w| ψ of a world w to a static formula ψ is defined by w| ψ [s] =

w[s]
s :s |=ψ w[s ] if s |= ψ, and w| ψ [s] = 0 otherwise. Note that w| ψ is undefined if there is no state s that satisfies ψ and has a non-zero probability in w.
Then the interpretation of epistemic formulas in a world w is defined by: where s $ ← w represents that a state s is sampled from the distribution w.
Then M, w |= ψ 0 ⊃ P I ψ 1 represents that the conditional probability of satisfying a static formula ψ 1 given another ψ 0 is included in a set I at a world w.
In each world w, measurement variables can be interpreted using σ w . This allows us to assign different values to different occurrences of a variable in a formula; E.g., in ϕ( Finally, the interpretation of an epistemic formula ϕ in M is given by: Hereafter we mainly focus on the satisfaction local to a possible world, and M may be omitted when it is clear from the context.

Modality as Transformation and Testing on Datasets
In this section we introduce a different view on the modal operators in statistical epistemic logic (StatEL), and define additional modal operators that are used to formalize various properties of machine learning in Sects. 5 to 7.

Checking Satisfaction at a World as Testing with a Dataset
We first show how we regard the satisfaction of a formula ϕ as testing a system's specification expressed by ϕ as follows.
As explained in Sect. 2.4, a possible world corresponds to a possible dataset. Thus, given a model M, a world w, and a formula ϕ, checking the satisfaction M, w |= ϕ can be regarded as testing whether the specification ϕ of a system (e.g., a machine learning model we formalize in Sect. 4) is satisfied when the dataset w provides inputs to the system. For example, let ϕ be a formula representing that a machine learning task (e.g., classification) C fails with probability at most 5%. Then M, w |= ϕ represents that when the learning task C is performed using a test dataset w, then it fails for at most 5% of the test data in w.
For simplicity, we discuss the satisfaction of the formulas ϕ in which neither K a nor P a occurs as follows. For each state (namely, data tuple) s ∈ w and for each static sub-formula ψ of ϕ, we can efficiently check whether s |= ψ.
When the dataset w is finite (i.e., it is a finite multiset of data tuples), we can check the satisfaction w |= ϕ in finite time, more precisely, in linear time in the number of elements in w.
When the dataset w is infinite, however, we cannot check whether w |= ϕ in general. For example, suppose that w is the infinite dataset representing a true distribution from which data are sampled and observed. When we cannot learn w itself, we usually obtain a finite dataset w fin by sampling data from w repeatedly and independently and check a specification ϕ only with this test dataset w fin .
Hereafter, we mainly deal with distributional Kripke models M that have infinite numbers of finite worlds. In the following sections except Sect. 6, we deal only with formulas without K a nor P a , 2 hence can check their satisfaction at a finite world in finite time.

Modal Operators for Dataset Transformation
In the rest of Sect. 3, we show that modal operators can be used to model the transformation and testing on datasets.
First, we introduce modal operators for dataset transformation. The modal operator ∆ T defined below is unary (i.e., taking a single formula as argument), and is parameterized with a transformation T between datasets. Intuitively, w |= ∆ T ϕ represents that a formula ϕ is satisfied for the dataset w that is obtained by transforming the current dataset w by T . Formally, the modal operator ∆ T is interpreted as follows.
Definition 5 (Modality ∆ T for a dataset transformation T ) Given a function T : W → W, we de-fine an accessibility relation as For example, machine learning often require data preparation to manipulate a given raw dataset into a form that makes a machine learning task feasible and more effective (e.g., data cleaning, data augmentation). For a dataset w and two ways of data preparation T0 and T1, w |= ∆ T0 ϕ ∧ ∆ T1 ϕ represents that a property ϕ holds for the two prepared datasets T0(w) and T1(w).
For another example, the security of machine learning often assumes a certain malicious adversary that can manipulate a given dataset to make a machine learning task fail. Such adversarial operations T on datasets can also be formalized using a different modal operator corresponding to T as we will explain in Sect. 6.
In the next section, we show that the logical connective ⊃ can be re-interpreted as the modality ∆ T for some dataset transformation T .

Modality for Conditioning
We then present another interpretation of the logical connective ⊃ (defined in Sect. 2.5) used to express conditional probabilities in Sects. 5 and 6. Roughly speaking, we regard the restriction w| ψ of a world w to a static formula ψ as a transformation R ψ of w. Then we redefine ⊃ as a modal operator associated with R ψ , and call it the conditioning operator. Formally, the interpretation of ⊃ is defined as follows.
Definition 6 (Conditioning operator ⊃) Assume that the universe W includes all sub-multisets of each w ∈ W. Given a static formula ψ, we define an accessibility relation as the conditioning relation R ψ Then the interpretation of the conditioning operator ⊃ is given by: Intuitively, w |= ψ ⊃ ϕ corresponds to the two operations: (i) transforming the given dataset w to the sub-dataset w| ψ and (ii) testing whether a property ϕ holds for the sub-dataset w| ψ . When no data in the dataset w satisfies the property ψ, we can describe this as M, w |= ψ ⊃ ⊥ by using the propositional constant falsum ⊥.
Note that the conditioning ψ ⊃ ϕ can be regarded as the modal formula ∆ T ϕ with the dataset transformation T where T (w) = w| ψ for all w ∈ W.
In Sects. 5 and 6, we show concrete examples using the conditioning operator ⊃, i.e., the classification performance and robustness of statistical classifiers.

Modality for Conditional Indistinguishability
Next, we introduce a modal operator that is used to formalize the fairness of machine learning in Sect. 7.
Given two static formulas ψ 0 , ψ 1 (e.g., representing male and female), w| ψ0 (x) (resp. w| ψ1 (x)) represents the probability distribution of values of a measurement variable x generated from the sub-dataset w| ψ0 , e.g., the sub-dataset about male (resp. w| ψ1 , e.g., about female). To formalize a certain similarity between x's values generated from the two sub-datasets (e.g., between the benefits for male and for female), we introduce a modal operator ∼ ε,D x for conditional indistinguishability as follows. We write ψ 0 ∼ ε,D x ψ 1 to represent that the two distributions w| ψ0 (x) and w| ψ1 (x) are indistinguishable up to a threshold ε in terms of a divergence or distance D. Formally, this modality is defined as follows. 3 Definition 7 (Conditional indistinguishability operator ∼ ε,D x ) Assume that the universe W includes all sub-multisets of each w ∈ W. Given an x ∈ Mes, an ε ∈ R ≥0 , and a divergence or distance D : DO × DO → R ≥0 , we define an accessibility relation by: Then for static formulas ψ 0 and ψ 1 , we define the interpretation of ψ 0 ∼ ε,D x ψ 1 by: where R ψ0 and R ψ1 are two conditioning relations in Definition 6.
Note that two worlds are related by R ε,D x if they have close probability distributions of the values of x. Intuitively, w |= ψ 0 ∼ ε,D x ψ 1 corresponds to the two operations: (i) transforming the given dataset w to the two sub-datasets w| ψ0 and w| ψ1 , and (ii) testing whether the probability distribution of x generated by the dataset w| ψ0 is indistinguishable from the distribution generated by the dataset w| ψ1 .
When ε = 0, the operator ∼ ε,D x represents the identity of two distributions.
Proposition 1 For a world w, static formulas ψ 0 , ψ 1 , and a measurement variable This proposition is immediate from the following lemma.
In Sect. 7, we present examples using the conditional indistinguishability operator, i.e., we formalize various notions of fairness in machine learning by using this operator and the above proposition and lemma.

Summary on the Modal Language
In summary, modal operators are used to represent transformation and testing on datasets. The unary modal operator ∆ T is regarded as a transformation T on datasets, while the binary modal operators ⊃ and ∼ ε,D x are regarded as transforming-then-testing on datasets.
Now the syntax of the formulas is given by: Static formulas: Dataset formulas: where the epistemic formulas with the additional modality are called dataset formulas, since they are interpreted in a world that corresponds to a dataset. When multiple transformations/testing are sequentially applied to datasets, we can use dataset formulas in which different modal operators are nested. For example, w |= ∆ T (ψ ⊃ ϕ) represents that after applying a data preparation T to a dataset w, a property ϕ holds for the sub-dataset T (w)| ψ that satisfies ψ.

Epistemic Model for Supervised Learning
In this section we introduce a formal model for supervised learning. Specifically, we employ a distributional Kripke model (Definition 3), and formalize a behavior of a classifier C and a non-deterministic input x from an adversary in the model. In this formalization, we focus only on the testing of supervised learning models, and do not formalize the training of supervised learning models or learning algorithms themselves.

Classification Problems
Multiclass classification is the problem of classifying a given input into one of multiple classes. Let L be a finite set of class labels 4 , and D be a finite set of input data (called feature vectors) that we want to classify. Then a classifier is a function C : D → L that receives an input datum v and predicts which class (among L) the input v belongs to. In this work, we deal with a situation where some classifier C has already been obtained and its properties should be evaluated, and do not model or reason about how classifiers are trained from a training dataset.
We assume a scoring function f : D × L → R that gives a score f (v, ) of predicting the class of an input datum (feature vector) v as a label . Then for each input v ∈ D, we denote by H(v) = to represent that a label maximizes f (v, ). For example, when the input v is an image of an animal and is the animal's name, then H(v) = may represent that an oracle (or a "human") classifies the image v as .

Modeling the Behaviors of Classifiers
Then W is an infinite set of possible worlds that corresponds to all possible datasets from which the classifier can receive input data. We denote by w test ∈ W a real world that corresponds to a test dataset. Recall that each world w ∈ W is a multiset of states over S and is associated with a stochastic assignment σ w : Mes → DO that is consistent with the deterministic assignments σ s for all s ∈ w, as explained in Sect. 2.4.
We present an overview of our formalization in Fig. 1. We denote by x ∈ Mes an input datum given to the classifier C (and to the oracle H), by y ∈ Mes a correct label given by the oracle H, and byŷ ∈ Mes a label predicted by C. We assume that the input variable x (resp. the output variables y,ŷ) ranges over the set D of input data (resp. the set L of labels); i.e., the deterministic assignment σ s at each state s ∈ S has the range O = D ∪ L and satisfies σ s (x) ∈ D and σ s (y), σ s (ŷ) ∈ L.
A key idea in our modeling is that we describe logical aspects of statistical properties in the syntax level by using logical formulas, and model statistical distances and dataset operations in the semantics level by using accessibility relations in the distributional Kripke Fig. 1: A world w is chosen non-deterministically and corresponds to a test dataset. With probability w[s i ], the world w is in a deterministic state s i where the classifier C receives the input value σ si (x) and returns the output value σ si (ŷ). Each state s i can be regarded as a tuple (σ si (x), σ si (y), σ si (ŷ)) ∈ D × L × L consisting of an input datum, an actual label, and a predicted label.
model. In this way, we can formalize various statistical properties of classifiers in a simple and abstract way.
To formalize the classifier C, we introduce a static formula ψ(x,ŷ) to represent that C classifies a given input x as a classŷ. We also introduce a static formula h(x, y) to represent that y is the actual class of an input x. As an abbreviation, we write ψ (x) (resp. h (x)) to denote ψ(x, ) (resp. h(x, )). Formally, these static formulas are interpreted at each state s ∈ S as follows: s |= h(x, y) iff H(σ s (x)) = σ s (y).

Modeling the Non-deterministic Inputs from Adversaries
We first observe that a distributional Kripke model M can formalize an input x that is probabilistically chosen from a given dataset. As explained in Sect. 2.4, each world w corresponds to a test dataset. When a state s is drawn from a multiset w of states, an input value σ s (x) is sampled from the distribution σ w (x), and assigned to the measurement variable x. The set of all possible probability distributions of inputs is represented by Λ def = {σ w (x) | w ∈ W}, which is possibly an infinite set. For example, let us consider testing the classifier C with the actual test dataset σ wtest (x). When C classifies an input x as a label with probability 0.2, i.e., then this can be expressed by: M, w test |= P 0.2 ψ (x).
Next we observe that our model can formalize a nondeterministic input x from an adversary as follows. Although each state s in a possible world w is assigned the probability w[s], each world w itself is not assigned a probability. Thus, each input distribution σ w (x) ∈ Λ itself is also not assigned a probability, hence our model assumes no probability distribution over Λ. In other words, we assume that a world w and thus an input distribution σ w (x) are non-deterministically chosen. This is useful to model an adversary that provides malicious inputs to the classifier C to make its prediction fail, because we usually do not have a prior knowledge of the probability distribution of malicious inputs from adversaries, and need to reason about the worst cases caused by the attack. In Sect. 6, this formalization of non-deterministic inputs is used to express the robustness of classifiers.
Finally, it should be noted that we cannot enumerate all possible adversarial inputs, hence cannot enumerate all possible datasets to construct the universe W. Since W can be an infinite set and is unspecified, we cannot check whether a formula expressing a security property against an adversary is satisfied in all possible worlds of W. Nevertheless, as shown in later sections, describing various properties using our extension of StatEL is useful to explore desirable properties and to discuss relationships among them.

Formalizing the Classification Performance
In this section we show a formalization of classification performance using our extension of StatEL. We formalize popular measures of classification performance, including precision, recall, and accuracy, and measures for evaluating overfitting, such as the generalization error. See Fig. 2 for basic ideas on these formalizations.

Classifier's Prediction and its Correctness
In classification problems, the terms positive/negative represent the result of the classifier's prediction, and the terms true/false represent whether the classifier predicts correctly or not. Then the following terminologies are commonly used: These terminologies can be formalized using static formulas as shown in Table 1. For example, when an input x shows true positive at a state s, this can be expressed as s |= ψ (x) ∧ h (x). Note that the value of the measurement variable x is uniquely determined by the assignment σ s at the state s. True negative, false positive (type I error), and false negative (type II error) are respectively expressed as s |= ¬ψ (x) ∧ ¬h (x), s |= ψ (x) ∧ ¬h (x), and s |= ¬ψ (x) ∧ h (x).

Precision, Recall, Accuracy, and Other Performance Measures
Next we formalize three popular measures for binary classification performance: precision, recall, and accuracy. In Table 1 we summarize the formalization of various notions of classification performance using our dataset formulas.
In theory, these notions should be formalized with the infinite dataset w true representing the true distribution. However, we usually cannot obtain w true or test the performance measures using w true . Hence, we often sample a finite test dataset w test from the true distribution and regard it as an approximation of w true . 5 Given a test dataset w test , precision (positive predictive value) is defined as the conditional probability that the prediction is correct given that the prediction is positive; i.e., precision = tp tp+fp . Since the probability distribution of the input x in the world w test is expressed by σ wtest (x) as explained in Sect. 4.3, the precision being within an interval I is given by: which can be written as: By using StatEL, this can be formalized as: Here ⊃ is the conditioning operator defined in Sect. 3.3. The value of precision depends on the test dataset w test , and can be computed in finite time since w test is finite. Symmetrically, recall (true positive rate) is defined as the conditional probability that the prediction is correct given that the actual class is positive; i.e., recall = tp tp+fn . Then the recall being within I is formalized as: Finally, accuracy is the probability that the classifier predicts correctly; i.e., accuracy = tp+tn tp+tn+fp+fn . Then the accuracy being within I is formalized as: 5 Since the test dataset w test is finite, there can be missing data that are not included in w test but are sampled from the true distribution w true with a very small probability.  which can also be defined as P I tp(x) ∨ tn(x) . When we measure the accuracy after a data preparation operation T (e.g., data cleaning) to the test dataset w test , this can be represented by w test |= ∆ T Accuracy ,I (x).
Example 1 (Performance of pedestrian detection) Let us consider an autonomous car that uses a machine learning classifier to detect a person crossing the road. For the sake of simplicity, we formalize an example of a binary classifier C that detects whether or not a pedestrian is crossing the road in a photo image in a test dataset w test . We write sunny(x) (resp. snowy(x)) to represent that a photo x was taken on a sunny (resp. snowy) day. Let ψ (x) (resp. h (x)) represent that the classifier C (resp. the human) detects a pedestrian crossing the road in an image x. We empirically measure recall (i.e., the conditional probability that C detects a pedestrian crossing the road when the input image x actually includes it) by using the data collected on sunny days. When C achieves a recall of 0.95 on sunny days, this is represented by w test |= sunny(x) ⊃ Recall ,0.95 (x).
Since C should detect a pedestrian also on a snowcovered road, it should be tested with the data collected on snowy days. If we have a recall of 0.8 on snowy days, this is represented by w test |= snowy(x) ⊃ Recall ,0.8 (x).

Generalization Error
We next formalize the generalization error of a classifier, i.e., a measure of how accurately a classifier is able to predict the class of previously unseen input data. Since a classifier has been trained on a finite sample training dataset w train , it may be overfitted to w train and have worse classification performance on new input data that have not been included in w train .
To formalize the generalization error, we introduce a formula λ L (y,ŷ) to represent that given a correct label y and a predicted labelŷ, the expected value of losses (i.e., real numbers representing the penalty for incorrect classification) is at most a non-negative real number L. Formally, the semantics of λ L (y,ŷ) is given by: where loss is a loss function selected according to the data domain O, and a pair (v, v ) of a correct label and a predicted label follows the joint distribution σ w (y,ŷ). Now the generalization error being L or smaller at a true distribution w true is written as w true |= GE L (x, y,ŷ) where: Since we usually cannot obtain the true distribution w true and cannot check the satisfaction w true |= GE L (x, y,ŷ), we often compute an empirical error (as an approximation of the generalization error) by using a finite test dataset w test that is believed to be an approximation of w true . This testing can be expressed as w test |= GE L (x, y,ŷ).
On the other hand, given a training dataset w train , the training error being at most L train is represented by w train |= GE L train (x, y,ŷ). Then the overfitting of the classifier can be evaluated by comparing the empirical error L with the training error L train . When the empirical error is smaller than L train +ε for some error bound ε > 0, this can be represented by w test |= GE L train +ε (x, y,ŷ).

Real world wtest
Possible world w Fig. 3: The robustness compares the conditional probability in the test dataset w test with that in another possible world w that is close to w test in terms of R ε,W d

Distribution of test data
x .
Note that an adversary's choice of the input distribution σ w (x) is formalized as a non-deterministic choice of the possible world w .

Formalizing the Robustness of Classifiers
Many recent studies have found attacks on machine learning where a malicious adversary manipulates the input to cause a malfunction in a machine learning task [12]. Such input data, called adversarial examples [40], are designed to make a classifier fail to predict the actual class of the input, but are recognized to belong to from human eyes. In computer vision, for example, Goodfellow et al. [20] create an adversarial example by adding undetectable noise to a panda's photo so that humans can still recognize the perturbed image as a panda, but a classifier misclassifies it as a gibbon. To prevent or mitigate such attacks, the classifier should be robust against perturbed input, i.e., it should return similar predicted labels given similar input data.
In this section we formalize robustness notions for classifiers by using epistemic operators in StatEL (See Fig. 3 for an overview of the formalization). Furthermore, we show certain relationships between classification performance and robustness, and suggest a class of robustness properties that have not been formalized in the literature as far as we know. We present an overview of these formalizations and relationships in Fig. 4.

Total Correctness of Classifiers
We first note that the total correctness of classifiers could be formalized as a classification performance (e.g., precision, recall, or accuracy) in the presence of all possible inputs from adversaries. For example, the total correctness could be formalized as M |= Recall ,I (x), which represents that Recall ,I (x) is satisfied in all possible worlds of M.
In practice, however, it is not possible or tractable to test whether the classification performance is achieved for all possible test datasets (corresponding to an infinite number of possible worlds in M). Hence we need a weaker form of a correctness notion, which may be verified or tested in some way. In the following sections, we deal with robustness notions that are weaker than total correctness.

Accessibility Relation for Robustness
To formalize robustness notions, we introduce an accessibility relation R ε,W d x that relates two worlds having closer inputs as follows.

Definition 8 (Accessibility relation for robustness) We define an accessibility relation
where W d is ∞-Wasserstein distance w.r.t. a metric d in Definition 2.
Then (w, w ) ∈ R ε,W d x represents that the two distributions σ w (x) and σ w (x) of inputs to the classifier C are close in terms of the distance W d . 6 Intuitively, for example, W d means the distance between two image datasets σ w (x) and σ w (x) when the distance between individual images are measured by a metric d .
Then an epistemic formula K ε,W d ϕ represents that we are confidence that ϕ is true even when the input data are perturbed by noise of the level ε or smaller.

Probabilistic Robustness against Targeted Attacks
When a robustness attack aims at misclassifying an input as a specific target labelˆ tar , then it is called a targeted attack. For instance, in the above-mentioned attack by [20], a gibbon is the target into which a panda's photo is misclassified.
In this section, we discuss how we formalize robustness using the epistemic operator K ε,W d . We denote by v ∈ D an original input image in the test dataset w test , and by v ∈ D an image obtained by perturbing the original image v by noise.
A first definition of robustness against targeted attacks might be: which represents that when an image v is obtained by perturbing a panda's photo v by noise, then it will not be classified as the target label gibbon at all. This can be formalized using StatEL by: However, this notion does not accept a negligible probability of misclassification, and does not cover the case where the human cannot recognize the perturbed image v as panda (e.g., when the perturbed image v is obtained by linear displacement, rescaling, and rotation [2], then H( v) = panda may hold).
To overcome these issues, we introduce the following definition with some conditional probability δ of misclassification as follows.
Definition 9 (Targeted robustness) Let δ ∈ [0, 1]. Given a dataset w test , a classifier C satisfies probabilistic targeted robustness w.r.t. an actual label and a target labelˆ tar if for any input v ∈ supp(σ wtest (x)) from the dataset w test , and for any perturbed input v ∈ D s.t. d (v, v ) ≤ ε, we have: For instance, when the actual class is panda and the target labelˆ tar is gibbon, then the classifier C misclassifies a panda's photo as gibbon with only a small probability δ. Now we express this robustness notion with I = [1 − δ, 1] by using StatEL.

Proposition 2 (Targeted robustness)
The probabilistic targeted robustness w.r.t. an actual label and a target labelˆ tar under a given test dataset w test is expressed by w test |= TRobust ,ˆ tar ,I (x) where: Proof Let w be a possible world such that (w test , w ) ∈ R ε,W d x . Then w corresponds to the dataset obtained by perturbing each data in w. Let v ∈ supp(σ w (x)).
tar (x). Therefore this proposition follows from the semantics for K ε,W d .
Since the L p -distances 7 are often regarded as reasonable approximations of human perceptual distances [10], they are used as distance constraints on the perturbation in many researches on targeted attacks (e.g. [40,20,10]). Our model can represent the robustness against these attacks by using the L p -distance as a metric d for R ε,W d x .

Probabilistic Robustness against Non-Targeted Attacks
In this section we formalize non-targeted attacks [33,32] in which adversaries try to misclassify inputs as some arbitrary incorrect labels (i.e., not as a specific label like a gibbon). Compared to targeted attacks, this kind of attacks are easier to mount, but harder to defend.
We first define the notion of robustness against nontargeted attacks as follows.
Definition 10 (Non-targeted robustness) Let δ ∈ [0, 1]. Given a dataset w test , a classifier C satisfies probabilistic non-targeted robustness w.r.t. an actual label if for any input v ∈ supp(σ wtest (x)) from the dataset w test , and for any perturbed input v ∈ D s.t. d (v, v ) ≤ ε, we have: Proof The proof is analogous to that for Proposition 2. 7 The L p -distance between n-dimensional real vectors x and x is written x − x p where the p-norm is defined by v p = ( n i=1 |v i | p ) 1/p .

Relationships among Robustness Notions
In this section we present relationships among notions of robustness and performance, and discuss properties related to robustness. We first present the following proposition immediate from the definitions.

Proposition 4 (Relationships among notions)
Let I ⊆ [0, 1] and ,ˆ tar ∈ L. Then we have: The first claim means that probabilistic non-targeted robustness is not weaker than probabilistic targeted robustness for the same I. The second claim means that probabilistic non-targeted robustness implies recall without perturbation noise. Note that this is immediate from the reflexivity of R ε,W d x .
Next we remark that our extension of StatEL can be used to describe a certain situation where adversarial attacks are mitigated. When we apply some mechanism T that preprocesses a given input to mitigate attacks on robustness, then the probabilistic targeted robustness is expressed as w test |= ∆ T Robust ,I (x) where ∆ T is the modality for the dataset transformation T .
Finally, we recall that by Proposition 3, robustness can be regarded as recall in the presence of perturbed noise. This implies that for each property ϕ in the table of confusion (Table 1), we could consider K ε,W d ϕ as a property to evaluate the classification performance in the presence of adversarial inputs although this has not been formalized in the literature of robustness of machine learning as far as we recognize. For example, precision robustness K ε,W d Precision ,i (x) represents that in the presence of perturbed noise, the prediction is correct with a probability i given that it is positive. For another example, accuracy robustness K ε,W d Accuracy ,i (x) represents that in the presence of perturbed noise, the prediction is correct (whether it is positive or negative) with a probability i.

Example 2 (Robustness of pedestrian detection)
We illustrate robustness notions using the pedestrian detection in Example 1 in Section 5.2. We deal with a binary classifier C that detects whether a pedestrian is crossing the road in a photo image x.
The non-targeted robustness K ε,W d Recall ,0.9 (x) represents that in the presence of perturbed noise to the input image x, with probability 0.9 the classifier C can detect a person crossing the road when the human can actually recognize. This robustness is crucial for an autonomous car not to hit a pedestrian.
The precision robustness K ε,W d Precision ,0.9 (x) represents that in the presence of perturbed noise to x, with probability 0.9 the human can actually recognize a person crossing the road when the classifier C detects it. This type of robustness is important for an autonomous car to avoid stopping suddenly due to a false alarm (not take the crash from the car behind).

Formalizing the Fairness of Classifiers
Many studies have proposed and investigated various notions of fairness in machine learning [5]. Informally, these fairness notions mean that the results of machine learning tasks are irrelevant of some sensitive attributes, e.g., gender, age, race, disease, political/religious view. In a recently few years, there have been studies on the testing methods for fairness of machine learning [18,1,42].
In this section, we formalize popular notions of fairness of supervised learning by using our extension of StatEL. Here we focus on the fairness that should be maintained in the impact (i.e., the results of machine learning tasks) rather than the treatment (i.e., the process of machine learning tasks). This is because previous research show that many seemingly neutral features have statistical relationships with sensitive attributes, and hence just ignoring or removing sensitive attributes in the process of data preparation and training 8 is often ineffective or harmful to achieve the fairness and performance of learning tasks.

Basic Ideas and Notations
Various notions of fairness in supervised learning are classified into three categories: independence, separation, and sufficiency [5]. All of these have the form of (conditional) independence or its relaxation, and thus can be formalized using the modal operator ∼ ε,D x for conditional indistinguishability (defined in Sect. 3.4) in our extension of StatEL. 9 In the formalization of fairness notions, we use a dis- Recall that x, y, andŷ are measurement variables respectively denoting the input datum, the actual class label (given by the oracle H), and the predicted label (output by the classifier C). Given a real world w test (corresponding to a given test dataset), σ wtest (x) is the probability distribution of C's test input over D, σ wtest (y) is the distribution of the actual label over L, and σ wtest (ŷ) is the distribution of C's output over L.
Fairness notions are usually defined in terms of some sensitive attribute (e.g., gender, age, race, disease, political/religious view), which is defined as a tuple of subsets of the input data domain D. For example, a sensitive attribute based on ages can be defined as a pair of groups G 0 (input data with ages 21 to 60) and G 1 (ages 61 to 100). For each group G ⊆ D of inputs, we introduce a static formula η G (x) representing that an input x belongs to G. Formally, this is interpreted by: Roughly speaking, a machine learning task is said to be fair if the performance of the task for a group G 0 's input is similar to that for another group G 1 's input. 10 In the following sections, we formalize the three categories of fairness of classifiers and their relaxation. A summary of this formalization is presented in Table 2. 8 Such unawareness requires that sensitive attributes are not explicitly used in the learning process. However, StatEL may not be suited to formalizing this requirement. 9 Compared to the preliminary version [28] of this paper, we corrected errors and changed the formalization into a more comprehensible form by introducing the operator ∼ ε,D x and by removing the counter factual epistemic operators and a formula ξ d representing that the input is drawn from a dataset d. 10 Some fairness notions (e.g., equal opportunity) assume G 1 = D \ G 0 .

Independence (a.k.a. Group Fairness, Statistical Parity) and its Relaxation
In this section we explain and formalize the notion of independence [9], which is also known as group fairness [15] 11 , and its relaxed notion. Intuitively, independence means that the predicted labelŷ does not have statistical relationships with the membership in a sensitive group. For example, independence does not allow a bank's lending rate to be correlated with a sensitive attribute such as gender.
We first present the definition of a relaxed notion of independence, called group fairness up to bias ε [15] as follows. Intuitively, this is the property that the output distributions of the classifier are roughly identical when input data belong to different groups.
Formally, this fairness notion is defined as follows.
Definition 11 (Independence, group fairness) Let G 0 , G 1 ⊆ D be sets of input data constituting a sensitive attribute. For each b = 0, 1, let µ G b ∈ DL be the probability distribution of the predicted labelˆ output by a classifier C when an input v is sampled from a test dataset w test and belongs to G b ; i.e., for eachˆ ∈ L, Then a classifier C satisfies the group fairness between groups G 0 and G 1 up to bias ε if D tv (µ G0 µ G1 ) ≤ ε, where D tv is the total variation between distributions (defined in Sect. 2.2). A classifier C satisfies independence w.r.t. groups G 0 and G 1 if it satisfies the group fairness between G 0 and G 1 up to bias 0. Now we express this fairness notion using our extension of StatEL as follows.
Proposition 5 (Independence, group fairness) The group fairness between groups G 0 and G 1 up to bias ε under a given test dataset w test is expressed as w test |= GrpFair ε (x,ŷ) where: Independence (without bias ε) is expressed by w test |= GrpFair 0 (x,ŷ).

11
In previous literature, independence has been referred to also as different terminologies, such as statistical parity, demographic parity, and disparate impact.
7.3 Equal opportunity (a relaxation of separation) 7.4 Sufficiency (a.k.a. conditional use accuracy equality) Example 3 (Independence in pedestrian detection) We illustrate independence using the pedestrian detection in Example 1 in Section 5.2. We deal with a binary classifier C that detects whether or not a pedestrian is crossing the road in an image x. We write η m (x) (resp. η w (x)) to represent that an image x includes a man (resp. woman) that may or not be crossing the road. Let ψ(x,ŷ) represent that given an input image x, the classifier C returnsŷ (that is either the detection of a person crossing the road or not).
Then the independence between men and women GrpFair 0 (x,ŷ) def = η m (x)∧ψ(x,ŷ) ∼ 0,Dtv y η w (x)∧ψ(x,ŷ) means that the probability of detecting a pedestrian crossing the road is the same between men and women. This fairness guarantees that men and women are equally detectable as pedestrians, hence equally safe against an autonomous car. Here independence does not rely on the actual label y, i.e., on whether there is a pedestrian crossing the road that can be detected by human eyes.

Separation (a.k.a. Equalized Odds) and its Relaxation (Equal Opportunity)
In this section we explain and formalize the notion of separation [5] 12 , which is well-known as equalized odds [22], and its relaxed notion called equal opportunity [22]. The motivation behind these notions is to capture typical scenarios in which sensitive characteristics may have statistical relationships with the actual class label. For instance, even when some sensitive attribute is correlated with an actual default rate on loans, banks might want to have a different lending rate for people who have a higher default rate. However, independence 12 In previous literature, separation has been referred to also as disparate mistreatment [46] and conditional procedure accuracy equality [6].
(group fairness) does not allow this, since it requires that the lending rate should be statistically independent of the sensitive attribute.
To overcome this problem, the notion of separation allows statistical relationships between a sensitive attribute and the predicted labelŷ output by the classifier C to the extent that this is justified by the actual class label y. More precisely, separation means that the predicted labelŷ is conditionally independent of the membership in a sensitive group, given an actual class label y.
Formally, separation is defined as a property that recall (true positive rate) and specificity (true negative rate, explained in Table 1) are the same for all the groups, and equal opportunity is defined as a special case of separation only for an advantageous class label.

Definition 12 (Separation & equal opportunity)
Given a group G b ⊆ D and an actual class label , let µ G b , ∈ DL be the probability distribution of the predicted labelˆ output by a classifier C when an input v ∈ G b is sampled from a test dataset w test and is associated with an actual label ; i.e., for eachˆ ∈ L, A classifier C satisfies separation between two groups G 0 and G 1 if µ G0, = µ G1, holds for all ∈ L. A classifier C satisfies equal opportunity of an advantageous label w.
Now we express these two notions using our extension of StatEL as follows.
It should be noted that for ε > 0, EqOdds ε (x,ŷ) represents a relaxation of separation up to bias ε in terms of total variation D tv .
The equal opportunity of a label w.r.t. a group G 0 under a given test dataset w test is expressed as w test |= EqOpp(x,ŷ) where: Proof The proof of this proposition is similar to that of Proposition 6. Let G 1 = D\G 0 . By µ G b , = σ w b, (ŷ), the equal opportunity of w.r.t. G 0 is given by σ w 0, (ŷ) = σ w 1, (ŷ). Therefore, this proposition follows from Proposition 1.
Example 4 (Separation in pedestrian detection) We illustrate separation using the pedestrian detection in Example 3 where a binary classifier C detects whether a pedestrian is crossing the road in an image x. Let ψ(x,ŷ) (resp. h(x, y)) represent that given an image x, the classifier C (resp. human) returnsŷ (resp. y) representing either detection or not.
The level of the inherent technical difficulty of detecting a female pedestrian may be different from that of a male pedestrian, because, for example, the physical appearance may tend to be different between women and men. If we take this possible difference into account, separation can be suited instead of independence.
The separation EqOdds 0 (x,ŷ) between men and women guarantees that the conditional probability of detecting a pedestrian crossing the road when the human can actually recognize it, is the same between men and women. This fairness implies that (from the viewpoint of a pedestrian crossing the road) male and female pedestrians may be hit by an autonomous car as fairly as by the human-driven car.

Sufficiency (a.k.a. Conditional Use Accuracy Equality)
In this section we explain and formalize the notion of sufficiency [5], which is also known as conditional use accuracy equality [6].
While separation guarantees the equality of recall among different groups, sufficiency requires the equality of precision. More precisely, sufficiency is defined as the property that precision (positive predictive value) and negative predictive value (presented as NPV in Table 1) are the same for all the groups as follows.
Definition 13 (Sufficiency) Given a group G b ⊆ D and a predicted labelˆ , let µ G b ,ˆ ∈ DL be the probability distribution of the actual class label when an input v ∈ G b is sampled from a test dataset w test and the classifier C outputs the predicted labelˆ ; i.e., for each ∈ L, A classifier C satisfies sufficiency between two groups G 0 and G 1 if µ G0,ˆ = µ G1,ˆ holds for allˆ ∈ L.
Then this notion can be expressed using our extension of StatEL as follows.
It should be noted that for ε > 0, Sufficency ε (x, y) represents a relaxation of sufficiency up to bias ε in terms of total variation D tv .
Example 5 (Sufficiency in pedestrian detection) We illustrate sufficiency using the pedestrian detection in Example 3 where a classifier C detects whether a pedestrian is crossing the road in an image x. As mentioned in Example 4, the level of the inherent technical difficulty of detecting a male pedestrian may be different from that of a female pedestrian. Whereas separation guarantees the equality of recall between men and women, sufficiency guarantees that of precision.
The sufficiency Sufficency 0 (x, y) between men and women implies that the conditional probability that there is no pedestrian crossing the road when C detects it, is the same between men and women. From the viewpoint of the car driver, when C raises a false alarm and stops the car suddenly, we have no bias about which of men and women are more likely to trigger false alarms and to be blamed for that.

Related Work
In this section, we provide a brief overview of related work on the specification of statistical machine learning and on epistemic logic for describing specification.
Desirable properties of statistical machine learning. There have been a large number of papers on attacks and defences for deep neural networks [40,12]. Compared to them, however, not much work has been done to explore the formal specification of various properties of machine learning. Seshia et al. [38] present a list of desirable properties of DNNs (deep neural networks) although most of the properties are presented informally without mathematical formulas. As for robustness, Dreossi et al. [13] propose a unifying formalization of adversarial input generation in a rigorous and organized manner, although they formalize and classify attacks (as optimization problems) rather than define the robustness notions themselves.
Concerning the fairness notions, Barocas et al. [5] survey various fairness notions and classify them into the three categories: independence, separation, and sufficiency. Gajane [17] surveys the formalization of fairness notions for machine learning and present some justification based on social science literature.
Epistemic logic for describing specification. Epistemic logic [44] has been studied to represent and reason about knowledge and belief [16,21], and has been applied to describe various properties of distributed systems.
The BAN logic [8], proposed by Burrows, Abadi and Needham, is a notable example of epistemic logic used to model and verify the authentication in cryptographic protocols. To improve the formalization of protocols' behaviors, some epistemic approaches integrate process calculi [24,11].
Epistemic logic has also been used to formalize and reason about privacy properties, including anonymity [39,19,29], receipt-freeness of electronic voting protocols [25], and privacy policy for social network services [34]. Temporal epistemic logic is used to express information flow security policies [3].
Concerning the formalization of fairness notions, previous work in formal methods has modeled different kinds of fairness involving timing by using temporal logic rather than epistemic logic. As far as we know, no previous work has formalized fairness notions of machine learning by using modal logic.
Formalization of statistical properties. In studies of philosophical logic, Lewis [31] shows the idea that when a random value has various possible probability distributions, then those distributions should be represented on distinct possible worlds. Bana [4] puts Lewis's idea in a mathematically rigorous setting. Recently, a modal logic called statistical epistemic logic (StatEL) [27] has been proposed and used to formalize statistical hypothesis testing and the notion of differential privacy [14].
To describe statistical properties of machine learning models, this work uses StatEL to formalize the probabilistically chosen input to a learning model and the non-deterministically chosen dataset. However, we could possibly employ other logics (e.g., fuzzy logic [45] or Markov logic network [37]) by extending them to deal with statistical sampling and non-deterministic inputs. Exploring the possibility of different formalization using other logics is left for future work.

Conclusion
In this paper we proposed an epistemic approach to the modeling of supervised learning and its desirable properties. Specifically, we employed a distributional Kripke model in which each possible world corresponds to a possible dataset and modal operators are interpreted as transformation and testing on datasets. Then we formalized various notions of the classification performance, robustness, and fairness of statistical classifiers by using our extension of statistical epistemic logic (StatEL). In this formalization, we clarified relationships among properties of classifiers, and relevance between classification performance and robustness.
We emphasize that this is the first attempt to use epistemic models and logical formulas to describe statistical properties of machine learning, and would be a starting point to develop theories of formal specification of machine learning.
In future work, we are planning to extend our framework to formally reason about system-level properties of learning-based systems. We are also interested in developing a more general framework for the formal specification of machine learning associated with testing methods, as well as in implementing a prototype tool. Our future work will also include an extension of StatEL to formalize unsupervised learning and reinforcement learning.