Machine Learning Assisted Exploration for Affine Deligne–Lusztig Varieties

Dong, Bin; He, Xuhua; Jin, Pengfei; Schremmer, Felix; Yu, Qingchao

doi:10.1007/s42543-024-00086-8

Machine Learning Assisted Exploration for Affine Deligne–Lusztig Varieties

Original Article
Open access
Published: 15 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Peking Mathematical Journal Aims and scope Submit manuscript

Machine Learning Assisted Exploration for Affine Deligne–Lusztig Varieties

Download PDF

Bin Dong^1,2,
Xuhua He ORCID: orcid.org/0000-0002-1810-4700³,
Pengfei Jin¹,
Felix Schremmer³ &
…
Qingchao Yu¹

294 Accesses
Explore all metrics

Abstract

This paper presents a novel, interdisciplinary study that leverages a Machine Learning (ML) assisted framework to explore the geometry of affine Deligne–Lusztig varieties (ADLV). The primary objective is to investigate the non-emptiness pattern, dimension, and enumeration of irreducible components of ADLV. Our proposed framework demonstrates a recursive pipeline of data generation, model training, pattern analysis, and human examination, presenting an intricate interplay between ML and pure mathematical research. Notably, our data-generation process is nuanced, emphasizing the selection of meaningful subsets and appropriate feature sets. We demonstrate that this framework has a potential to accelerate pure mathematical research, leading to the discovery of new conjectures and promising research directions that could otherwise take significant time to uncover. We rediscover the virtual dimension formula and provide a full mathematical proof of a newly identified problem concerning a certain lower bound of dimension. Furthermore, we extend an open invitation to the readers by providing the source code for computing ADLV and the ML models, promoting further explorations. This paper concludes by sharing valuable experiences and highlighting lessons learned from this collaboration.

Learning algebraic varieties from samples

Article Open access 13 August 2018

On some simple geometric structure of affine Deligne–Lusztig varieties for $${{\,\textrm{GL}\,}}_n$$

Article Open access 12 June 2023

Geometric Arveson–Douglas Conjecture-Decomposition of Varieties

Article 06 December 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 A Brief Overview of Affine Deligne–Lusztig Varieties

The concept of Affine Deligne–Lusztig Varieties (ADLV) was first introduced by Rapoport [51]. These varieties serve as a group-theoretic model for the reduction of Shimura varieties and shtukas with parahoric level structure and play a vital role in arithmetic geometry and the Langlands program. Key problems associated with ADLV include:

Non-emptiness pattern;
Dimension;
Enumeration of irreducible components.

Over the past 2 decades, the study of ADLV has been a vibrant research topic. Significant progress has been made in understanding fundamental problems, and important applications to number theory and the Langlands program have been discovered. The non-emptiness pattern and dimension of ADLV in the affine Grassmannian are now fully understood, and in most cases, they are also known for ADLV in the affine flag variety. The enumeration of irreducible components has been solved in the affine Grassmannian case. For more in-depth information, readers are referred to the survey article [27].

Despite these advancements, a comprehensive solution to the problems presented by ADLV remains a challenge, primarily due to the difficulty in finding explicit patterns.

In this paper, our focus is on ADLV $X_w(b)$ in the affine flag variety (associated with the Iwahori level structure). Information for other parahoric level structures can be obtained from the Iwahori level structure via the natural projection map. The ADLV $X_w(b)$ depends on two parameters: the element w in the Iwahori–Weyl group ${{\tilde{W}}}$ of a loop group $\breve{G}$, and the Frobenius-twisted conjugacy class [b] of $\breve{G}$. We consider the map from the pair (w, b) to the dimension and enumeration of the top-dimensional irreducible components of $X_w(b)$ (refer to Sect. 2.1 for a precise statement of the problem). The image of a given pair (w, b) can be calculated using the intricate inductive algorithm established in [25]. Yet, the goal is to derive more explicit formulas/information for this map. Such explicit formulas and information are particularly intriguing because of their broad applications to arithmetic geometry and number theory.

1.2 Machine Learning Assisting Pure Mathematics Research

In recent years, machine learning (ML), particularly deep learning, has exerted a profound impact on diverse scientific and engineering disciplines, bringing about substantial changes in the way we conduct research. The success of ML primarily hinges on the development of the designing and training of deep neural networks, which approximate complex, high-dimensional mappings with desirable accuracy, and rapid evaluations. As a result, ML has become a leading force in different areas of artificial intelligence (AI), such as natural language processing (large language models such as ChatGPT [68]), computer vision (e.g., NeRF [44], diffusion models [32]) and games (e.g., mastering the game of Go [57], Poker [6], and StarCraft II [64]). Moreover, the excellent approximation capabilities of deep neural networks have helped discover new patterns or principles within large, multi-dimensional data sets. This has significantly expanded ML’s role in natural science, contributing to the rise of a new field known as “AI for Science”. Successful examples include the work of AlphaFold [34], molecular dynamics simulations [11, 33, 67], chemical discovery [43, 59, 65], system identification [8, 40, 41, 50], controllable nuclear fusion [14, 35], etc.

More recently, Geordie Williamson and DeepMind used ML to assist in research-level explorations of pure mathematics [13]. They present an ML-based framework that augments mathematicians’ intuition, aiding in the discovery and understanding of complex mathematical relationships. This approach identifies potential correlations between two mathematical entities by deriving a function approximating the relationship and helping mathematicians analyze it.

The framework validates possible patterns in mathematical objects using supervised learning, and helps understand these patterns using attribution techniques. In the supervised learning stage, a hypothesis about a connection between two entities is proposed, a dataset is generated, and a function is trained to predict one entity from the other. The role of ML here is in learning a wide variety of potential non-linear functions given sufficient data. Attribution techniques are then used to understand the trained function and propose a potential relationship. One such technique, gradient saliency, calculates the derivative of function outputs with respect to inputs, helping to identify the most relevant problem aspects. This process may be iterative until a feasible conjecture is found.

In essence, this ML-guided framework enables a quick verification of the potential worthiness of an intuition about a relationship and, if validated, suggests how they may be related. This framework has already proven its utility by [13] in achieving significant results, such as uncovering the first relationships between algebraic and geometric invariants in knot theory and conjecturing a resolution to the combinatorial invariance conjecture for symmetric groups.

1.3 Our Objective

In this study, our objective is to develop an ML-assisted framework to guide the study of fundamental problems related to ADLV, specifically the non-emptiness pattern, the dimension and enumeration of irreducible components. As illustrated in Figs. 1 and 2, our framework showcases a recursive pipeline of data generation, model training, pattern analysis, and human examination. Despite similarities with the framework in the aforementioned study [13], several crucial differences exist. Our data-generation process is more intricate, particularly regarding the selection of a meaningful subset of pairs (w, b) and an appropriate set of features. Additionally, after fitting a functional relationship between the feature set and the property of interest (e.g. the non-emptiness of $X_w(b)$), the patterns revealed by salience analysis may be more challenging for mathematicians to interpret due to the problem’s complexity. Nevertheless, we found that this interaction between ML and human mathematicians significantly accelerates pure mathematical research, enabling us to identify new conjectures and promising research directions that could otherwise take years for mathematicians to discover by themselves.

We provide the source code for computing geometric invariants of ADLV and machine learning models to invite interested readers to delve into this problem, reproduce our experiments, and refine our approach by studying different datasets and feature sets. In Sect. 5, we seek a linear approximation for the dimension of ADLV, leading us to rediscover the virtual dimension formula. Originally, the development of the virtual dimension formula required several years of intense research by many mathematicians and constitutes a major milestone in the field. In Sect. 6, we conduct sensitivity analysis for the problems introduced in Sect. 1.1. We identify several important features, affirming some of the latest research results in the field and suggesting potential next steps. Motivated by the experiments in Sects. 5 and 6, we discover a new problem concerning a certain lower bound of the dimension, which has not been studied in the literature before. In Sect. 7, we provide a full mathematical proof of this lower bound. We conclude this paper by suggesting future work in Sect. 8, sharing some experiences, and highlighting lessons learned from the collaboration.

2 Preliminaries

2.1 Definition and Properties of Affine Deligne–Lusztig Varieties

In this subsection, we provide a brief overview of affine Deligne–Lusztig varieties. Unless otherwise stated (i.e., Sects. 4.1 and 7), we focus on the case of the special linear group ${\textrm{SL}}_n$, which is also referred to as the type $A_{n-1}$. By focusing exclusively on this case, we may reduce the technical details in this exposition. However, the more important reason for this specialization is that it allows us to perform computer experiments within a reasonably narrow scope, where all the beauties and pathologies of the general case are still present.

Let q be a prime power and ${\mathbb {F}} _q$ be the finite field with q elements. We define $F = {\mathbb {F}} _q{(\!({t})\!)}$ to be the field of formal Laurent series over ${\mathbb {F}} _q$. This means that elements in $a\in F$ are formal power series

$$\begin{aligned} a = \sum _{i\in {\mathbb {Z}}} a_i t^i \end{aligned}$$

with coefficients $a_i\in {\mathbb {F}} _q$, such that $a_i=0$ for almost all $i<0$. There is no notion of convergence involved, but the definition of addition and multiplication in F mimics the behavior of absolutely convergent power series over real or complex numbers.

Pick once and for all an algebraic closure ${{\overline{{\mathbb {F}}}}}_q$ and define $\breve{F} = {{\overline{{\mathbb {F}}}}}_q{(\!({t})\!)}$ to be the field of formal Laurent series over ${{\overline{{\mathbb {F}}}}}_q$. The Galois group of the field extension $\breve{F}/F$ is generated by the Frobenius $\sigma $, which can be evaluated for elements in $\breve{F}$ as

$$\begin{aligned} \sigma \left( \sum _{i\in {\mathbb {Z}}} a_i t^i\right) = \sum _{i\in {\mathbb {Z}}}a^{q}_i t^i\in \breve{F}. \end{aligned}$$

Finally, we write ${\mathcal {O}}_{\breve{F}} = {{\overline{{\mathbb {F}}}}}_q{[\![{t}]\!]}$ for the ring of all formal power series, i.e., elements $a\in \breve{F}$ with $a_i=0$ for all $i<0$.

Throughout this paper till Sect. 7, we will focus on the algebraic group scheme ${\textrm{SL}}_n$, considered as a scheme over ${\mathcal {O}}_F = {\mathbb {F}} _q{[\![{t}]\!]}$. We get an induced map $\sigma : {\textrm{SL}}_n(\breve{F})\rightarrow {\textrm{SL}}_n(\breve{F})$, given by applying the above Frobenius $\sigma :\breve{F}\rightarrow \breve{F}$ to the entries of each $n\times n$-matrix in ${\textrm{SL}}_n(\breve{F})$.

Two elements $b,c\in {\textrm{SL}}_n(\breve{F})$ are called $\sigma $-conjugate if there exists some $g\in {\textrm{SL}}_n(\breve{F})$ with

$$\begin{aligned} b = g^{-1} c \sigma (g). \end{aligned}$$

One checks that this is an equivalence relation, similar to ordinary conjugacy. We denote the $\sigma $-conjugacy class of $b\in {\textrm{SL}}_n(\breve{F})$ by [b], and the set of $\sigma $-conjugacy classes of ${\textrm{SL}}_n(\breve{F})$ by $B({\textrm{SL}}_n)$. The $\sigma $-conjugacy class of $b\in {\textrm{SL}}_n(\breve{F})$ is uniquely determined by an invariant called the Newton point of b, denoted by $\nu _b\in {\mathbb {Q}}^n$ [37].

Call a vector $(\nu _1,\dotsc ,\nu _n)\in {\mathbb {Q}}^n$ dominant if $\nu _1\geqslant \cdots \geqslant \nu _n$. Then, the Newton point of each $b\in {\textrm{SL}}_n(\breve{F})$ is such a dominant vector. We have an action of the symmetric group $S_n$ on ${\mathbb {Q}}^n$ by permutation of coordinates. One checks that each orbit under this action contains precisely one dominant vector. If $b\in {\textrm{SL}}_n(\breve{F})$ is a diagonal matrix of the form $b = {{\,\textrm{diag}\,}}(\pm t^{b_1},\dotsc ,\pm t^{b_n})$ with $b_1,\dotsc ,b_n\in {\mathbb {Z}}$, then the Newton point of b is the unique dominant element in the $S_n$-orbit of $(b_1,\dotsc ,b_n)\in {\mathbb {Z}}^n\subseteq {\mathbb {Q}}^n$.

One calls $W_0:= S_n$ the (finite) Weyl group of ${\textrm{SL}}_n$. The affine Weyl group is given by the semidirect product

$$\begin{aligned} W_a = {{\tilde{W}}} := S_n\ltimes \{(\mu _1,\dotsc ,\mu _n)\in {\mathbb {Z}}^n\mid \mu _1+\cdots + \mu _n=0\}. \end{aligned}$$

We write elements $w\in W_a$ also as $w = t^\lambda z$, where $z\in S_n$ and $\lambda =(\lambda _1,\dotsc ,\lambda _n)\in {\mathbb {Z}}^n$ with $\lambda _1+\cdots +\lambda _n=0$. The symbol t is a formal variable reminding us of the uniformizer $t\in F$.

We choose for each permutation $x \in S_n$ a representative $\dot{x} \in {\textrm{SL}}_n(\breve{F})$, so that

$$\begin{aligned} \dot{x}_{i,j} = {\left\{ \begin{array}{ll} \pm 1,&{}\quad i=x(j),\\ 0,&{}\quad i \ne x(j). \end{array}\right. } \end{aligned}$$

Concretely, the element $\dot{x}$ may be chosen to be the permutation matrix of x if x is an even permutation, and the permutation matrix of x with one sign flipped if x is an odd permutation. For $w = t^\lambda z\in W_a$, we write $\dot{w}\in {\textrm{SL}}_n(\breve{F})$ for the element $\dot{w} = {{\,\textrm{diag}\,}}(t^{\lambda _1},\dotsc ,t^{\lambda _n}) \dot{z}$.

Let $\breve{I}\subset {\textrm{SL}}_n(\breve{F})$ be the Iwahori subgroup

Then, each element $g\in {\textrm{SL}}_n(\breve{F})$ has the form $g = i_1 \dot{w} i_2$ for a uniquely determined element $w\in W_a$ and non-unique $i_1, i_2\in \breve{I}$ (see [7]). Such a decomposition can be computed using an adaption of the Gauss algorithm.

We have seen two decompositions of the set ${\textrm{SL}}_n(\breve{F})$, namely, one into $\sigma $-conjugacy classes $B({\textrm{SL}}_n)$ and another one into Iwahori double cosets $\breve{I}\dot{w} \breve{I}$ for $w\in W_a$. We also write $\breve{I} w\breve{I}:= \breve{I}\dot{w} \breve{I}$, since the double coset is independent of the choice of $\dot{w}$.

The right coset space $Fl={\textrm{SL}}_n(\breve{F})/\breve{I}$ is called the affine flag variety. It is an ind-scheme over ${{\overline{{\mathbb {F}}}}}_q$, which behaves similarly to finite-dimensional varieties over that field. For $w\in W_a$ and $b\in {\textrm{SL}}_n(\breve{F})$, we define the affine Deligne–Lusztig variety to be the subvariety

$$\begin{aligned} X_w(b) := \{g \breve{I} \in Fl\mid g^{-1} b \sigma (g)\in \breve{I}w\breve{I}\}. \end{aligned}$$

Again, $X_w(b)$ is not actually a variety, but still a scheme over ${{\overline{{\mathbb {F}}}}}_q$. Moreover, each irreducible component of $X_w(b)$ is an actual finite-dimensional variety over ${{\overline{{\mathbb {F}}}}}_q$. If b is $\sigma $-conjugate to $c\in {\textrm{SL}}_n(\breve{F})$, say $c = h^{-1} b \sigma (h)$ for $h\in {\textrm{SL}}_n(\breve{F})$, then

$$\begin{aligned} X_w(b)\rightarrow X_w(c),\quad g\breve{I}\mapsto h^{-1}g\breve{I} \end{aligned}$$

is an isomorphism. Thus, we may associate the isomorphism type of $X_w(b)$ to the pair $(w,[b])\in W_a\times B({\textrm{SL}}_n)$. This is our main object of interest.

We see that the $\sigma $-centralizer of b, denoted by

$$\begin{aligned} {\textbf{J}}_b(F) = \{g\in {\textrm{SL}}_n(\breve{F})\mid g^{-1}b\sigma (g) = b\}, \end{aligned}$$

acts on $X_w(b)$ by left multiplication. Up to that action, there are only finitely many irreducible components in $X_w(b)$. Since each such irreducible component is a finite-dimensional variety over ${{\overline{{\mathbb {F}}}}}_q$, the entire affine Deligne–Lusztig variety $X_w(b)$ is always finite-dimensional.

The main questions regarding the geometry of affine Deligne–Lusztig varieties are the following three:

Non-emptiness pattern: given (w, [b]), determine whether $X_w(b) \ne \emptyset $;
Dimension: given (w, [b]) such that $X_w(b) \ne \emptyset $, calculate the dimension of $X_w(b)$;
Irreducible components: given (w, [b]) such that $X_w(b) \ne \emptyset $, calculate the number of ${\textbf{J}}_b(F)$-orbits of top dimensional irreducible components, i.e., the cardinality of ${\textbf{J}}_b(F) \backslash \Sigma ^{\textrm{top}}(X_w(b))$. Here, $\Sigma ^{\textrm{top}}(X)$ denotes the set of top-dimensional irreducible components of X.

2.2 Important Invariants

To address these three questions, one explores the relationship between the affine Weyl group $W_a$ and the set $B({\textrm{SL}}_n)$ parametrized by the Newton points. This relationship is essentially combinatorial in nature, which allows us to compute the answers to the three above questions for any given pair $(w,\nu _b)$. We summarize some key combinatorial invariants that have proven valuable in previous works.

The group $W_0 = S_n$ is known to be a Coxeter group with respect to the generators ${\mathbb {S}} = \{s_1,\dotsc ,s_{n-1}\}$. Here, $s_i$ is the simple reflection interchanging i with $i+1$ and leaving everything else fixed. Each element $x\in W_0$ is a product of the $s_i$, and the shortest length of such an expression is called the length of x. There is a unique element of maximal length in $W_0$, denoted by $w_0$. It is the permutation $w_0(i) = n+1-i$ for $i\in \{1,\dotsc ,n\}$, and its length is $\ell (w_0) = n(n-1)/2$.

There is a different way to compute the length of an element $x\in W_0$. Denote the set of roots of ${\textrm{SL}}_n$ as

$$\begin{aligned} \Phi = \{e_i - e_j\in {\mathbb {Q}}^n\mid i,j\in \{1,\dotsc ,n\}\text { and }i\ne j\}. \end{aligned}$$

The root $\alpha _{i,j}:=e_i-e_j$ is called positive if $i<j$, and negative otherwise. We write $\delta (\alpha ) = 1$ if $\alpha $ is a negative root and 0 if $\alpha $ is positive. Observe that the $S_n$-action on ${\mathbb {Q}}^n$ preserves the set of roots. Then, the length of $x\in W_0$ equals the number of positive roots $\alpha _{i,j}$, such that $x\alpha _{i,j}$ is a negative root. As formula

$$\begin{aligned} \ell (x) = \sum _{i<j}\delta (x\alpha _{i,j}). \end{aligned}$$

The group $W_a = S_n\ltimes {\mathbb {Z}}^n$ is also known to be a Coxeter group with respect to ${\mathbb {S}} _a$. The set of simple affine reflections ${\mathbb {S}} _a$ is given by $\{s_0\}\cup {\mathbb {S}}$, where

$$\begin{aligned} s_0 = (1~n)t^{(-1,0,\dotsc ,0,1)}\in W_a. \end{aligned}$$

One defines the length of an element $w\in W_a$ as above, given by the smallest representation using these simple affine reflections. There is an alternative way to compute the length of $w = t^{\lambda } z\in W_a$: We saw above that there is some $y\in W_0$ with $y z^{-1}\lambda \in {\mathbb {Z}}^n$ being dominant. Among all those elements in $y\in W_0$, there is a unique one with $\ell (y)$ being minimal. For this specific $y\in W_0$, we write $x:= zy^{-1}\in W_0$ and $\mu := y z^{-1}\lambda \in {\mathbb {Z}}^n$, so that $w = xt^{\mu } y$. Then

$$\begin{aligned} \ell (xt^\mu y) = \langle \mu ,2\rho \rangle + \ell (x)-\ell (y). \end{aligned}$$

Here, $\langle \cdot ,\cdot \rangle $ is the standard Euclidean inner product on ${\mathbb {Q}}^n$, and $2\rho = (n-1,n-3,\dotsc ,3-n,1-n)\in {\mathbb {Q}}^n$ is the sum of positive roots. Whenever we write w in the form $xt^\mu y$, we always assume that $x,\mu ,y$ have been chosen as above.

For $c\in {\mathbb {Z}}_{\geqslant 0}$, we call the element $w = xt^\mu y$ to be c-regular if $\langle \mu ,\alpha _{i,j}\rangle \geqslant c$ for all positive roots $\alpha _{i,j}$. The decomposition of w into x, $t^\mu $, and y has the most desirable properties whenever w is 2-regular, but the above length formula is always true even when such a regularity condition is not satisfied.

To each $w\in W_a$, one may associate the $\sigma $-conjugacy class $[\dot{w}]\in B({\textrm{SL}}_n)$. Its Newton point can be computed as follows: If w has the form $w = t^\lambda $ for some $\lambda \in X_*(T)$, then $\dot{w} = {{\,\textrm{diag}\,}}(t^{\lambda _1},\dotsc ,t^{\lambda _n})$ and we saw above that the Newton point of $\dot{w}$ is the unique dominant element in the $S_n$-orbit of $\lambda $. For general $w\in W_a$, one may find an integer $m\geqslant 1$, such that $w^m$ is of the above form, and then, the Newton point of $\dot{w}$ is given by $\nu _w = \nu _{w^m}/m\in {\mathbb {Q}}^n$.

It turns out that each $\sigma $-conjugacy class $[b]\in B({\textrm{SL}}_n)$ contains the representative $\dot{w}\in {\textrm{SL}}_n(\breve{F})$ of some $w\in W_a$, cf. [25, Theorem 3.7]. Hence, the above method yields all Newton points of all $\sigma $-conjugacy classes. If $[b]\in B({\textrm{SL}}_n)$ has Newton point $\nu _b = (\nu _1,\dotsc ,\nu _n)\in {\mathbb {Q}}^n$, we define the best integral approximation $\lfloor \nu _b\rfloor \in {\mathbb {Z}}^n$ to be the vector $(\mu _1,\dotsc ,\mu _n)\in {\mathbb {Z}}^n$, such that for all i, we have

$$\begin{aligned} \lfloor \nu _1+\cdots + \nu _i\rfloor = \mu _1+\cdots + \mu _i\in {\mathbb {Z}}. \end{aligned}$$

Equivalently, $\lfloor \nu _b\rfloor $ is the unique vector in ${\mathbb {Z}}^n$ that can be written in the form

$$\begin{aligned} \lfloor \nu _b\rfloor = \nu _b - c_1 \alpha _{1,2}-\cdots - c_{n-1}\alpha _{n-1,n}, \end{aligned}$$

such that $0\le c_i<1$ for $i=1,\dotsc ,n-1$. The defect of $[b]\in B({\textrm{SL}}_n)$ can then be defined as ${\textrm{def}}(b) = \langle \nu _b-\lfloor \nu _b\rfloor ,2\rho \rangle $. Alternatively, it can be computed as the number of non-integral coordinates of $\nu $, i.e., the number of indices $i\in \{1,\dotsc ,n\}$, such that $\nu _i\in {\mathbb {Q}}{\setminus }{\mathbb {Z}}$.

2.3 Computing the Geometry of Affine Deligne–Lusztig Varieties

In this section, we present a combinatorial algorithm that efficiently computes the answers to the above three main questions for a given pair $(w,\nu _b)$, where $w\in W_a$ and $\nu _b\in {\mathbb {Q}}^n$. Although the subsequent sections of this article do not depend on the specific algorithm used, we want to provide a detailed explanation of its workings.

It is worth noting that the algorithm is somewhat complex and non-deterministic, which accounts for why the three main questions are still considered open. While the algorithm yields a computational solution to each of the three problems, one may still seek a more straightforward and satisfactory characterization. Nonetheless, the algorithm allows for effective computation of the desired results, which is crucial for our practical applications.

Let $w\in W_a$. We explain an algorithm that computes, in finite time, the set

$$\begin{aligned} \{\nu _b\mid [b]\in B({\textrm{SL}}_n)\text { and } X_w(b)\ne \emptyset \}\subseteq {\mathbb {Q}}^n \end{aligned}$$

(which hence must be finite). For each occurring Newton point, the corresponding affine Deligne–Lusztig variety is uniquely determined up to isomorphism, and our algorithm computes the dimension of this variety and the number of ${\textbf{J}}_b(F)$-orbits of its top-dimensional irreducible components.

For a simple affine reflection $s\in {\mathbb {S}} _a$, we say that $sws\in W_a$ is a cyclic shift of w if $\ell (sws)\leqslant \ell (w)$. Under this condition, $\ell (sws)$ can either be equal to $\ell (w)$ or $\ell (w)-2$.

In the first case, i.e., $\ell (sws) = \ell (w)$, the affine Deligne–Lusztig varieties $X_w(b)$ and $X_{sws}(b)$ are always isomorphic for all $[b]\in B({\textrm{SL}}_n)$. Therefore, to compute the above data for w, we may freely pass between w and sws.

In the second case, $\ell (sws) = \ell (w)-2$, each affine Deligne–Lusztig variety $X_w(b)$ splits into two parts, so $X_w(b) = U\sqcup V$ is the disjoint union of two subsets, with U being open in $X_w(b)$ and V closed. Then, there are surjective maps with irreducible one-dimensional fibers

$$\begin{aligned} U\twoheadrightarrow X_{ws}(b),\qquad V\twoheadrightarrow X_{sws}(b). \end{aligned}$$

Hence, $U\ne \emptyset $ if and only if $X_{ws}(b)\ne \emptyset $. In this case, $\dim U = \dim X_{ws}(b)+1$. The set U is ${\textbf{J}}_b(F)$-invariant, and the number of ${\textbf{J}}_b(F)$-orbits of top dimensional irreducible components agrees for U and $X_{ws}(b)$. The same story happens for V and $X_{sws}(b)$. Once we know this geometric information, the corresponding data for $X_w(b) = U\sqcup V$ are easily computed: If $U=V=\emptyset $, then $X_w(b)=\emptyset $. If precisely one of the subsets U or V is empty and the other one is non-empty, then $X_w(b)$ agrees with the unique non-empty subset, and all geometric invariants are known. Finally, if $U\ne \emptyset \ne V$, we have $\dim X_w(b) =\max (\dim U,\dim V)$ and

$$\begin{aligned} \# {\textbf{J}}_b(F){\setminus } \Sigma ^{\text {top}}(X_w(b))&={\left\{ \begin{array}{ll} \#{\textbf{J}}_b(F){\setminus } \Sigma ^{\text {top}}(U),&{}\quad \dim U>\dim V,\\ \#{\textbf{J}}_b(F){\setminus } \Sigma ^{\text {top}}(V),&{}\quad \dim V>\dim U,\\ \#{\textbf{J}}_b(F){\setminus } \Sigma ^{\text {top}}(U)+ \#{\textbf{J}}_b(F){\setminus } \Sigma ^{\text {top}}(V),&{}\quad \dim U=\dim V. \end{array}\right. } \end{aligned}$$

Moreover, if $\ell (sws)=\ell (w)-2$, then $\ell (ws)=\ell (w)-1$. Therefore, in this case, we have reduced the geometric questions for w and arbitrary [b] to the same questions of the two elements sws, ws of smaller length.

The first part of the algorithm iteratively enumerates the cyclic shift class of w, i.e., the set of all elements in $W_a$ reachable by iterated cyclic shifts $w\rightarrow s_1 w s_1\rightarrow s_2 s_1 w s_1 s_2\rightarrow \cdots $. Each element in this cyclic shift class has length $\leqslant \ell (w)$, so that the cyclic shift class is a finite set.

We traverse this cyclic shift class, until we either exhaust the full set or reach some element $w'$ in the cyclic shift class and some $s'\in {\mathbb {S}} _a$ with $\ell (w) = \ell (w') = \ell (s' w' s')+2$. In the latter case, we know $X_w(b)\cong X_{w'}(b)$ and the geometric properties of $X_{w'}(b)$ can be reduced to recursively calling our algorithm for the two smaller elements $s' w'$ and $s' w' s'$. Since the length drops by at least one, such a recursive call can only happen a finite number of times.

The second part of the algorithm handles the case where the entire cyclic shift class is enumerated without reaching a length-reducing cyclic shift. In this case, we not only know that w has a minimal length in this cyclic shift class but even that it must have a minimal length in its conjugacy class in $W_a$.

In this case, we can explicitly state that

$$\begin{aligned} \{\nu _b\mid X_w(b)\ne \emptyset \} = \{\nu _w\}. \end{aligned}$$

For $\nu _b = \nu _w$, we know $\dim X_w(b) = \ell (w) - \langle \nu _b,2\rho \rangle $ and that the number of ${\textbf{J}}_b(F)$-orbits of top dimensional irreducible components is equal to 1.

This algorithm is guaranteed to terminate in finite time with the correct result. We note that the algorithm itself is non-deterministic, since there is no canonical way to traverse the cyclic shift class of an element $w\in W_a$. While the above method allows us to compute the dimension of $X_w(b)$ for arbitrary w, [b], it is far from being a closed formula. In many cases, however, one may expect that such a closed formula can be found.

The first part of this algorithm is due to Görtz–He [18, Corollary 2.5.3], the second part is due to He and Nie [29, Theorem A], [25, Theorem 4.8].

2.4 Machine Learning-Assisted Formula Exploration

As mentioned earlier, the algorithm used to compute the geometry of affine Deligne–Lusztig varieties is somewhat complex, non-deterministic, and implicit. However, practical research often requires an explicit expression or pattern. Machine learning, particularly deep neural network models, excels at fitting and analyzing high-dimensional mappings. Thus, we are considering utilizing machine learning to aid us in exploring formulas.

The machine learning-assisted formula exploration process is broken down into several steps, as illustrated in Fig. 1. First, a suitable problem $Y=f(X)$ needs to be selected, where f is a mapping from known features X to the variable Y of interest, and a dataset $\{X_i,Y_i\}_{i=1}^N$ is generated using a known algorithm. Second, a machine learning model ${\hat{f}}_\theta $ is chosen, and an optimal approximation ${\hat{f}}_{\theta ^*}$ of f is attained through optimization on the generated dataset. Third, hints about the patterns of f are obtained by analyzing the explicit expression of ${\hat{f}}_{\theta ^*}$. Finally, the obtained pattern can inspire us to introduce new features X or consider functions f with different domains. From this, we can modify f(X), return to the first step, and continue this cycle. Each step will be elaborated on in detail in the following.

Regarding f, our primary concern is mapping form (w, [b]) to the geometry of affine Deligne–Lusztig varieties $X_w(b)$, including dimension, whether $X_w(b) \ne \emptyset $ and number of irreducible components. However, due to the inherent complexity of this mapping, it may be necessary to use the computed features of w and b as X, such as $\ell (w)$ and $\text {defect}(b)$.

The selection of a suitable machine learning model ${\hat{f}}_{\theta }$ requires careful consideration of the inherent complexity of f. Generally speaking, complex models have greater expressive power, but they may be more difficult to optimize and may require larger datasets. In this article, we focus on several commonly used models, including linear models [1, 48], Support Vector Machine (SVM) [9, 12], and neural networks [52, 54]. The specific formulation of the models will be described in detail in the subsequent section.

Given a dataset ${\mathcal {D}}=\{X_i,Y_i\}_{i=1}^N$, and the model ${\hat{f}}_{\theta }$, the next step is to identify the optimal value for $\theta $ that minimizes the distance between ${\hat{f}}_{\theta }$ and the target function f (known as training) and evaluate the performance of ${\hat{f}}_{\theta }$ (known as testing). Typically, the dataset ${\mathcal {D}}$ is partitioned into two distinct subsets: the training set ${\mathcal {D}}_{tr}$ and the testing set ${\mathcal {D}}_{te}$. The former is used to obtain the optimal $\theta ^*$, while the latter is used to evaluate $f_{\theta ^*}$.

The training process seeks to minimize the following loss function:

$$\begin{aligned} \theta ^*=\mathop {\mathrm {arg\,min}}_\theta \sum _{{(X_i,Y_i)}\in {\mathcal {D}}{tr}} {\mathcal {L}}(Y_i,{\hat{f}}_{\theta }(X_i))+\lambda {\mathcal {R}}(\theta ), \end{aligned}$$

where ${\mathcal {L}}$ is a distance metric function, such as cross-entropy for classification problems or $L_2$ distance for regression problems. The regularization term for the parameters, ${\mathcal {R}}(\theta )$, depends on prior knowledge about the parameters and is typically expressed using the $L_1$ and $L_2$ norm to ensure simplicity or sparsity of the expression and prevent overfitting [2, 42]. The parameter $\lambda $ controls the strength of regularization, with a larger value indicating a greater emphasis on the simplicity or sparsity of the model parameters depending on the specific choice of ${\mathcal {R}}(\theta )$.

After obtaining $\theta ^*$, we typically calculate the loss function and accuracy on the testing set. If the model ${\hat{f}}_{\theta ^*}$ exhibits relatively low loss and high accuracy on both the training and testing sets, we can consider it as a good approximation of the target function f. This indicates that the model has successfully generalized from the training set to unseen data and is likely to perform well on new data.

Once we obtain ${\hat{f}}_{\theta ^*}$, we analyze the patterns of f by examining ${\hat{f}}_{\theta ^*}$. First, the complexity of f can be analyzed by observing the accuracy of ${\hat{f}}_{\theta ^*}$ under different hyperparameters, such as the number of hidden neurons and layers in the neural network. This allows for a rough estimation of f’s complexity. Second, we can determine the sensitivity of f to different features by taking the derivative of ${\hat{f}}_{\theta ^*}$, enabling the determination of the significance of these features. Third, if the form of ${\hat{f}}_{\theta ^*}$ is relatively simple or becomes simple after sparse optimization, an approximate explicit expression of f can be directly obtained. Finally, error analysis of $f_{\theta ^*}$ also facilitates the understanding of f’s properties, such as differences in complexity in varying regions.

In cases where a suitable pattern cannot be obtained, this could indicate an improper selection of our mapping. For instance, if the number of features X is insufficient or if f is too complex, it may be challenging to find a simple ${\hat{f}}_\theta $ that can approximate the function and reveal its underlying patterns. In such instances, we need to modify f(X) based on the insights gained from the previous round of exploration, such as by adding specific features or considering the properties of f on specific domains, and continue with the next round of exploration. This process continues iteratively.

3 Fundamental Concepts of Machine Learning and Associated Caveats

Machine learning is a field that employs computational models to learn patterns in data. This section will provide an overview of some basic machine learning models and discuss several crucial caveats in employing these models.

3.1 Machine Learning Models

3.1.1 Linear Models

Linear models are perhaps the simplest type of machine learning model, and they make a good starting point for the study of machine learning algorithms. They model the relationship between the input features and the output as a linear combination of the input features.

Suppose we have $p$ input features, a linear model is a hyperplane and is given by the equation

$$\begin{aligned} {\hat{Y}}= {\hat{f}}_{\theta }(X) = \beta ^\top X-b = \beta _{(1)} X_{(1)} + \beta _{(2)}X_{(2)} +\cdots + \beta _{(p)} X_{(p)}-b. \end{aligned}$$

Here, $[X_{(1)}, X_{(2)},\ldots , X_{(p)}]=X\in {\mathbb {R}}^p$ are the input features, ${\hat{Y}}$ is the output, and $\{[\beta _{(1)},\ldots ,\beta _{(p)}]=\beta \in {\mathbb {R}}^p, b\}=\theta $ are the parameters of the model. The parameters are typically learned from the data using a method called least squares which minimizes the sum of the squared residuals, the differences between the observed data $Y_i$ and the predicted output ${\hat{Y}}_i$

$$\begin{aligned} \min _{\beta , b} \sum _{i=1}^{N} (Y_i - \beta ^\top X_i +b )^2. \end{aligned}$$

Despite their simplicity, linear models can be quite effective in practice, particularly when the data are actually linearly separable or close to it.

3.1.2 Support Vector Machines (SVM)

Support Vector Machines (SVMs) are a set of supervised learning methods particularly well suited for classification of complex but small- or medium-sized datasets.

The search for the optimal hyperplane is formulated as an optimization problem. The aim is to minimize the norm of the weight vector, $\Vert \beta \Vert ^2$, which corresponds to maximizing the margin. This is subject to a set of constraints ensuring that each data point $X_i$ is correctly classified with respect to the hyperplane

$$\begin{aligned} Y_i(\beta ^\top X_i - b) \ge 1, \quad \forall i. \end{aligned}$$

In this constraint, $Y_i$ represents the class label of the data point $X_i$, with a value of either +1 or ${-}$ 1. The expression $Y_i(\beta ^\top X_i - b)$ should be greater than or equal to 1 for all $i$, indicating that each data point is on the correct side of the margin or, at the very least, on the boundary of the margin.

Therefore, the optimization problem can be described more precisely as

$$\begin{aligned} \min _{\beta , b} \frac{1}{2} \Vert \beta \Vert ^2 \quad \text {subject to} \quad Y_i(\beta ^\top X_i - b) \geqslant 1, \quad \forall i. \end{aligned}$$

This formulation ensures that the hyperplane has the maximum possible margin between the classes, while achieving accurate classification of the data points.

For non-linearly separable data, SVMs utilize the “kernel trick”, a method to map the input data into a higher dimensional space where it can be linearly separable. Different kernel functions can be used depending on the nature of the data, such as polynomial kernels and radial basis function (RBF) kernels. Despite their mathematical complexity, SVMs have a geometric interpretation and can be intuitively understood as trying to find the “widest possible street” that separates the different classes.

3.1.3 Neural Network (NN) Regression

Neural networks are flexible function approximators that use layers of neurons to model complex relationships. In a regression context, a neural network learns a mapping from inputs to a continuous output. Each neuron applies a series of transformations, first a linear transformation and then a non-linear activation function, to its input.

Consider a fully connected network [47] with a total of $n_L$ layers, each containing $n_H$ neurons, where the input is $X^{(0)}=X$ and the output is ${\hat{Y}}={\hat{f}}(X)=X^{(n_L)}$. The neuron $j$ in layer $l$ can be represented as

$$\begin{aligned} X_{(j)}^{(l)} = \rho \left( \sum _{i=1}^{n_H} w_{ji}^{(l)} X_{(i)}^{(l-1)} + b_{j}^{(l)}\right) , \end{aligned}$$

where $w_{ji}^{(l)}$ and $b_j^{(l)}$ are the weights and bias for neuron $j$ in layer $l$, $X_{(i)}^{(l-1)}$ are the outputs of the neurons in the previous layer, and $\rho $ is the activation function. Common choices for $\rho $ include the sigmoid, hyperbolic tangent, and rectified linear unit (ReLU) functions. The network’s weights and biases are learned by minimizing a loss function, often the mean squared error for regression tasks, using an algorithm such as gradient descent or one of its variants. In the interest of generality, it is common in machine learning to omit the bias term and represent the function directly in matrix form

$$\begin{aligned} {\hat{f}}(X)=\rho (W{n_L}\cdots W_2\rho (W_1X)\cdots ), \end{aligned}$$

where this representation is an equivalent formulation. It can be interpreted as appending an additional dimension to the feature vector set to 1, transforming the expression $\sum _{i=1}^{n} w_{ji} X_{(i)} + b_{j}$ into $\sum _{i=1}^{n+1} w_{ji} X_{(i)}$, with $w_{j(n+1)}$ equating to $b_{j}$ and $X_{n+1}=1$. For clarity, unless otherwise specified, all neural networks discussed in this paper include the bias term.

3.1.4 Neural Network (NN) Classification

In a classification context, a neural network learns to classify inputs into discrete categories. The architecture is similar to that of a regression network, but the final layer typically has as many neurons as there are classes, and it applies a softmax function to produce a probability distribution p over the classes.

The output of the $j$th neuron in the softmax layer, denoted by $p_{(j)}$, can be represented as

$$\begin{aligned} p_{(j)} = \frac{e^{X_{(j)}^{(n_L)}}}{\sum _{k=1}^{K} e^{X_{(k)}^{(n_L)}}}, \end{aligned}$$

where $X_{(j)}^{n_L}$ is the input to the $j$th neuron in the softmax layer, and $K$ is the number of classes. The weights and biases are optimized by minimizing the cross-entropy loss, which measures the discrepancy between the predicted and true probability distributions across the classes.

3.2 Caveats

Machine learning methods have demonstrated tremendous efficacy across a variety of tasks. However, several potential caveats may affect their performance and interpretation:

3.2.1 Choice of Model

In a scenario where your primary goal is salience analysis, the choice of a machine learning model is heavily influenced by interpretability, ability to reveal feature importance, and the potential to infer functional forms between inputs and outputs. Here are some considerations to help guide your model selection:

1.
Interpretability: When the aim is to understand what input features are important and their relationships with the output, interpretability becomes a crucial criterion for model selection. Models, such as linear regression, logistic regression, and decision trees, are traditionally more interpretable than, say, deep neural networks or support vector machines.
2.
Inherent Feature Importance: Some models inherently provide feature importance measures. For example, the size of coefficients in linear models can indicate feature importance, and tree-based models provide feature importance based on the frequency of a feature being used to split the data.
3.
Flexibility vs. Interpretability: More flexible models like neural networks can model complex relationships, but they often lack interpretability. Conversely, simpler models like linear regression provide clearer insight into relationships between variables but may fail to capture complex, non-linear relationships.
4.
Trade-Off Between Accuracy and Interpretability: There is often a trade-off between model accuracy and interpretability. In salience analysis, we might be willing to sacrifice some accuracy for better interpretability, which should be factored into the model selection process.

3.2.2 Training and Testing, Overfitting

One common pitfall in machine learning is overfitting, where the model learns the training data too well and performs poorly on unseen test data. In salience analysis scenarios, understanding the relationship between features and output is crucial. To ensure the model captures the actual underlying relationships and not the noise, controlling overfitting is essential. Here are some strategies that might help:

1.
Regularization: This technique penalizes the complexity of the model, discouraging learning overly complex patterns that might be due to noise. The common forms of regularization include $\ell _1$- and $\ell _2$-regularization.
2.
Early Stopping (for Neural Networks): During the training process, monitor the model’s performance on a validation set. Stop training as soon as the performance on the validation set begins to degrade.
3.
Dropout (for Neural Networks): Randomly “dropping out” units in a neural network during training can prevent complex co-adaptations on training data, which helps to avoid overfitting [58].
4.
Interpretable Models: If the main goal is to understand the relationships between features and output, using simpler, interpretable models such as linear regression or decision trees could be beneficial. These models may be less prone to overfitting compared to complex models like deep neural networks.

3.2.3 Randomness

When conducting salience analysis with machine learning models, it is crucial to understand and handle the randomness introduced by stochastic training processes. You want to ensure that the salience you identify is not due to the randomness in the training process, but truly indicative of the underlying relationships in your data. Below are a few strategies specifically for this context:

1.
Feature Importance Measures: Many models provide ways to measure the importance of features, either directly (like coefficients in linear models), or indirectly (like gradients in neural networks). However, remember that these measures are impacted by the stochasticity of training. To mitigate this, consider averaging feature importance over multiple runs or training multiple models using different random seeds and averaging their feature importance measures.
2.
Ablation Studies: One way to understand the importance of a feature is to see how much the model’s performance drops when that feature is removed. By conducting this analysis over multiple runs (with different random seeds), you can obtain a measure of feature importance that is robust to the randomness of the training process.
3.
Controlled Training Processes: Reduce the randomness in the training process through techniques, such as decreasing the learning rate over time, using a larger batch size, or using a different optimization algorithm less sensitive to stochasticity, like RMSprop [53] or Adam [36].

3.2.4 Data Imbalance

Machine learning models can perform poorly when there is a class imbalance in the training data. In such cases, the model might be biased toward the majority class. Handling imbalanced data is crucial, especially in a context where understanding the relationships between features and output is of primary importance. We list some strategies specifically geared toward such scenarios:

1.
Resampling Techniques: One can alter the dataset itself to address the imbalance:
- Oversampling the Minority Class: This involves creating or synthesizing new instances of the minority class until it reaches a similar number as the majority class. While it can balance the classes, it might lead to overfitting due to the replication of the minority instances.
- Undersampling the Majority Class: This involves removing instances from the majority class until it reaches a similar number as the minority class. While it can be effective in balancing the dataset, it may cause loss of information by excluding potentially important instances from the majority class.
Both approaches aim to balance the distribution between the majority and minority classes but come with potential drawbacks that must be carefully considered during implementation.
2.
Cost-Sensitive Learning: While this technique can always be used in cases of imbalanced data, the costs need to be chosen carefully in this context. A simple heuristic like setting the cost inversely proportional to class frequency might not be the best choice, as it might lead to the model focusing too much on rare classes that might not have enough instances to derive reliable feature-output relationships. Another method is adjusting the temperature of the softmax function in the output layer of the model. The softmax function is often used in the final layer of a classification neural network to convert the outputs to probability values for each class. You can adjust the temperature based on the class frequencies, so that the model is made more sensitive to the minority class. However, keep in mind that it needs to be used carefully, as it can make the model more prone to overfitting to the minority class.

3.2.5 Dataset Size

Throughout this paper, we engage with datasets of different sizes depending on the complexity of the model under study—ranging from simple linear models to more elaborate two- or three-layer neural networks. When the goal is to extract a concise and clear formula, we prefer simpler models such as linear models, which have fewer parameters, thereby requiring less data. On the other hand, for feature analysis, a more accurate representation of the unknown function relationship is desired, without overfitting, which calls for more sophisticated models and larger datasets. In practice, the actual size of a dataset is typically determined through an empirical approach that involves training a model on datasets of different sizes and comparing the results. Upon reaching a point where improvement is only marginal, we consider the dataset size to be sufficient.

4 Programs

In this section, we introduce our program, which is composed of two primary modules. The first module computes variables of interest in affine Deligne–Lusztig varieties, as described in Sect. 4.1. The second module analyzes the data generated using machine learning techniques, as detailed in Sect. 4.2. Figure 2 illustrates the overall workflow of the program.

4.1 Program for Affine Deligne–Lusztig Varieties

We give a short introduction to the program for the affine Deligne–Lusztig varieties based on Python.

Choosing a group. The first input determines the type of algebraic group. The valid inputs are An, Bn, Cn, Dn, 2An, and 2Dn.

To simplify notation, we focus on the group $G=\textrm{SL}_n$ (type $A_{n-1}$) throughout this section. However, we note that the program is compatible with all classical groups.

Input for w. The element $w \in W_a$ can be expressed in the following two ways:

the product of the translation part and the finite part;
the product of a sequence of simple reflections.

Element of the form $w = t^{(a_1,\ldots ,a_n)}s_{i_1} \cdots s_{i_r}$, where $(a_1,\ldots , a_n)\in X_*$ and all $i_j \in \{1,2,\ldots ,n-1\}$, is written as $\text {affine}\_\text {Weyl}([a_1,\ldots ,a_n],[i_1,\dots ,i_r])$. Elements of the form $w = s_{i_1} \cdots s_{i_r}$, where all $i_j \in \{0,1,2,\ldots ,n-1\}$ is written as exp($[i_1,\ldots ,i_r])$. For example, in $n=3$ case (type A2). affine$\_$Weyl([1, 0, $-1$],[1, 2]) = exp([0, 2]).

The simple reflections are s[0], s[1],$\ldots $, s[$n-1$]. The identity element of $W_a$ is Id.

Input for b . In this program, we input the Newton point $\nu _b$ of b instead of b. Note that in the case of ${\textbf{G}}= {\textrm{SL}}_n$, the conjugacy class [b] is determined by its Newton point.

Function (dim)

Computing dimension of ADLV:

$$\begin{aligned} \text {dim} (w,\nu ) \end{aligned}$$

Input: $w\in {{\tilde{W}}}; \nu \in {\mathbb {Q}}^{n}$

Output: $\dim X_w(b) $

Description: If $X_w(b)=\emptyset $, the output is “empty”.

Function (irr)

Computing irreducible components of ADLV:

$$\begin{aligned} \text {irr}(w,\nu ) \end{aligned}$$

Input: $w\in {{\tilde{W}}}; \nu \in {\mathbb {Q}}^{n}$

Output: $\sharp {\textbf{J}}_b(F) \backslash \Sigma ^\mathrm{{top}}X_w(b)$

Description: If $X_w(b)=\emptyset $, the output is 0.

Function (dim_irr_print)

Listing all b such that $X_w(b)\ne \emptyset $ and computing dimension and number of irreducible components:

$$\begin{aligned} \text {dim}\_\text {irr}\_\text {print}(w) \end{aligned}$$

Input: $w\in {{\tilde{W}}}$

Output: Print the following:

Newton point = $\nu $, dim = $\text {dim}X_w(b)$, irr = $\sharp {\textbf{J}}_b(F)\backslash \Sigma ^\mathrm{{top}}X_w(b)$

Description: The function lists all b, such that $X_w(b)\ne \emptyset $.

Example

A2 case.

Input:

$\hbox {w} = \text {affine}\_\text {Weyl}([1,1,-2],[2,1])$; dim$\_$irr$\_$print(w)

Output:

$\text {Newton point} = [1/2, 1/2, -1], \text {dim} = 1, \text {irr} = 1$

$\text {Newton point} = [0, 0, 0], \text {dim} = 3, \text {irr} = 1$

Input:

print(dim(w,[0, 0, 0]), irr(w,[$1/2,1/2,-1$]), dim(w,[$1,0,-1$]), irr(w,[$2,0,-2$]))

Output:

3 1 empty 0

This demonstrates the three functions applied to the element $w = t^{(1,1,-2)}s_2 s_1$. The first input demonstrates the behavior of the function dim_irr_print. The second input demonstrates the behavior of the functions dim and irr for various $[b]\in B(G)$ with $X_w(b)\ne \emptyset $ and $X_w(b)=\emptyset $. See Function (dim) and Function (irr) above for our convention for the dimension and the number of irreducible components of an empty set.

4.2 Program for Machine Learning

The program implemented in this module is primarily designed for generating datasets, training models, and performing analyses of the trained models. The input and output parameters, along with the intended usage of the four main functions, are outlined below:

Function (GenerateDataset)

Generate Dataset for Training:

$$\begin{aligned} \text {GenerateDataset }(\textrm{str}1, \textrm{str}2) \end{aligned}$$

Input: str1: filename for data; str2: filename for dataset

Description: The file generated in Sect. 4.1 is in the Numpy array format and named “str1”. For efficient subsequent operations, these data are converted into a PyTorch tensor and structured as a dataset named “str2”. The program automatically shuffles the data, allocating 80% as the training set and 20% as the testing set.

Function (LinearReg)

Linear Regression:

$$\begin{aligned} \text {LinearReg}({\textbf{X}},{\textbf{Y}},\lambda ) \end{aligned}$$

Input: ${\textbf{X}}\in {\mathbb {R}}^{N\times c}, {\textbf{Y}}\in {\mathbb {R}}^{N}, \lambda \in {\mathbb {R}}$

Output: $\beta \in {\mathbb {R}}^c, b\in {\mathbb {R}}$

Description: This function is used to solve a linear regression problem as outlined in Sect. 3.1.1. The hyperparameter $\lambda $, as discussed in Sect. 2.4, regulates the magnitude of the regularization term. The ith row of the matrix ${\textbf{X}}$ represents $X_i$, and the ith element of the vector ${\textbf{Y}}$ represents $Y_i$. The same convention applies throughout the following discussion.

Function (LinearCls)

SVM:

$$\begin{aligned} \text {LinearCls}({\textbf{X}},{\textbf{Y}},\lambda ) \end{aligned}$$

Input: ${\textbf{X}}\in {\mathbb {R}}^{N\times c}, {\textbf{Y}}\in {\mathbb {R}}^{N}, \lambda \in {\mathbb {R}}$

Output: $\beta \in {\mathbb {R}}^c, b\in {\mathbb {R}}$

Description: The function is used to solve SVM problem in Sect. 3.1.2, with the hyperparameter of the regularization term set to $\lambda $.

Function (NetReg)

Neural Network Regression:

$$\begin{aligned} \text {NetReg}({\textbf{X}},{\textbf{Y}},n_L,n_H,\lambda ) \end{aligned}$$

Input: ${\textbf{X}}\in {\mathbb {R}}^{N\times c}; {\textbf{Y}}\in {\mathbb {R}}^{N}; n_L\in {\mathbb {N}}^+; n_H\in {\mathbb {N}}^+; \lambda \in {\mathbb {R}}$

Output:

Description: The output ${\hat{f}}$ is a trained fully connected neural network encapsulated within an instance of “torch.nn.module”, a PyTorch class designed for layer management and network definition. It is to be noted that, unless explicitly stated, all neural networks referenced in this program conform to this standard representation. The network comprises $N_L$ layers, each with $N_H$ neurons, as described in Sect. 3.1.3. Weight decay is used as the regularization term, with the hyperparameter $\lambda $ controlling its strength.

Function (NetCls)

Neural Network Classification:

$$\begin{aligned} \text {NetCls}({\textbf{X}},{\textbf{Y}},n_L,n_H,\lambda ) \end{aligned}$$

Input: ${\textbf{X}}\in {\mathbb {R}}^{N\times c}; {\textbf{Y}}\in {\mathbb {R}}^{N}; n_L\in {\mathbb {N}}^+; n_H\in {\mathbb {N}}^+; \lambda \in {\mathbb {R}}$

Output:

Description: The symbol ${\hat{f}}$ denotes a trained fully connected network designed specifically for classification problems. The network comprises $N_L$ layers, each with $N_H$ neurons, as described in Sect. 3.1.4. Weight decay is used as the regularization term, with the hyperparameter $\lambda $ controlling its strength.

Function (NetGrad)

Sensitive Analysis:

$$\begin{aligned} \text {NetGrad}({\textbf{X}},{\textbf{Y}},{\hat{f}}) \end{aligned}$$

Input:

Output: $g\in {\mathbb {R}}^c$

Description: This function is utilized to quantify the sensitivity of individual features in the trained network ${\hat{f}}$. The sensitivity is gauged by evaluating the mean of the absolute values of the derivatives of the loss function with respect to each feature, taken over the dataset $\{{\textbf{X}},{\textbf{Y}}\}$. Specifically, for the jth feature, the sensitivity is calculated as

$$\begin{aligned} g_{(j)}=\frac{1}{N} \sum _{i=1}^N \left| \frac{\partial {\mathcal {L}}({\hat{f}}(X_i),Y_i)}{\partial X_{i,(j)}}\right| , \end{aligned}$$

where $X_{i,(j)}$ denotes the jth feature of the input example $X_i$, N is the total number of examples, and ${\mathcal {L}}$ represents the loss function. The term $g_{(j)}$ delivers an aggregate measure of how sensitive the loss function is to variations in the jth feature, thus quantifying the importance of that feature in the learned representation captured by ${\hat{f}}$.

5 Searching for a Dimension Formula

5.1 Virtual Dimension

The journey toward the dimension formula of affine Deligne–Lusztig varieties has a long history. For the affine Grassmannian case, Rapoport proposed the dimension formula in [51], drawing inspiration from Chai’s earlier work [10] on the length function of chains of $\sigma $-conjugacy classes. This conjecture was ultimately validated by a series of researchers, primarily [16, 20, 60, 70].

In this paper, our attention is on the affine flag case, a significantly more challenging problem. Görtz, Haines, Kottwitz, and Reuman proposed a conjectural formula for $\dim X_w(b)$ for most pairs (w, b) in [17]. This was partly inspired by the aforementioned dimension formula for the affine Grassmannian case. This conjecture was verified by He in [25] and [28]. However, our understanding of the remaining cases, which include many crucial applications to number theory and the Langlands program, remains limited. We aim to broaden this understanding using machine learning.

We revisit the concept of virtual dimension introduced by He in [25]. This was inspired by the conjecture of Görtz, Haines, Kottwitz, and Reuman. Let $w \in W_a$ and express it as $w = x t^\mu y$ as in Sect. 2.2. Define $\eta (w) = yx$ and

$$\begin{aligned} d_w(b) = \frac{1}{2}\bigl (\ell (w) + \ell (\eta (w))\bigr )-\langle \nu _b,\rho \rangle - \frac{1}{2}\text {def}(b). \end{aligned}$$

He demonstrated in [25] and [28] the following result.

Theorem 1

Suppose $X_w(b)\ne \emptyset $. Then

1.
$\dim X_w(b)\leqslant d_w(b)$.
2.
If w is 2-regular and $\mu -\nu _b$ is “sufficiently large”, then $\dim X_w(b)=d_w(b)$.

The virtual dimension formula is a practical approximation of the real dimension. The discovery of this formula took experts considerable time, stretching from Rapoport’s lectures in 1996 [51] to the introduction of the virtual dimension formula in the most general setting by He in 2012 [25]. This section aims to illustrate how machine learning could help us rediscover this formula, accelerating the research process counterfactually.

The dimension of affine Deligne–Lusztig varieties depends on two parameters, w and b. We note that the b-part of the virtual dimension formula is relatively simpler to uncover than the w-part, and we omit the details of the learning process for now. In this section, we concentrate on the group ${\textrm{SL}}_5$, the case $b=1$, and randomly generated elements w. We investigate how machine learning can shed light on the correlation between w and the dimension $\dim X_w(1)$. Our method does not rely on prior knowledge of the dimension formula in the Grassmannian case or the virtual dimension formula.

The selection of appropriate input features is crucial for machine learning. We work with the affine Weyl group of type ${{\tilde{A}}}_4$, where each element $w=t^\lambda u \in W_a$ is made up of the translation part $\lambda $ and the finite part u. We include both parts as input features: $\lambda $ as a vector, and u as a permutation denoted by $u=[u_1, u_2, u_3, u_4, u_5]$.

For classical Deligne–Lusztig varieties $X_w$ [15], it’s known that $\dim X_w=\ell (w)$. Therefore, we anticipate that the dimension of affine Deligne–Lusztig varieties $X_w(1)$ is also related to $\ell (w)$. Consequently, we include the length function for both w and u as input features.

5.2 Complexity Test

To evaluate the complexity of the mapping, we utilize neural networks. Specifically, we consider an $n_L$-layer fully connected network with ReLU activation and $n_H$ hidden neurons. This network can be expressed as

$$\begin{aligned} {\hat{f}}_\theta (X)=\beta ^\top \rho (W{n_L}\cdots W_2\rho (W_1X)\cdots ), \quad \theta ={W_1,W_2,\ldots ,W_{n_L},\beta }, \end{aligned}$$

where ${\hat{f}}_{\theta }(X)$ is the predicted output of the neural network for input $X\in {\mathbb {R}}^{c}$, and $\theta $ is the set of all trainable parameters, including weights

$$\begin{aligned} W_1\in {\mathbb {R}}^{n_H\times c}, W_2\in {\mathbb {R}}^{n_H\times n_H},\ldots , W{n_L}\in {\mathbb {R}}^{n_H\times n_H},\beta \in {\mathbb {R}}^{n_H}. \end{aligned}$$

We use the Rectified Linear Unit (ReLU) activation function, defined as $\rho (x)=\max (0,x)$, for each hidden layer.

To quantify the accuracy of the optimized ${\hat{f}}_{\theta ^*}$ in approximating f after training, we use two metrics: accuracy and mean error. Since the dimension is always an integer, we round the inferred results ${\hat{f}}_{\theta ^*}(X_i)$ to the nearest integer for accuracy. Specifically, we define accuracy as

$$\begin{aligned} \text {Accuracy}=\frac{1}{N}\sum _i^N \delta (Y_i,\mathrm{{round}}({\hat{f}}_{\theta ^*}(X_i))), \end{aligned}$$

where N is the number of samples, and $\delta (\cdot )$ is the indicator function that outputs 1 if the arguments are equal and 0 otherwise. The mean error is defined as

$$\begin{aligned} \text {Mean Error} = \frac{1}{N}\sum _i^N |Y_i-{\hat{f}}_{\theta ^*}(X_i)|. \end{aligned}$$

In the tables below, the bold values highlight the terms with relatively large absolute values, which indicate possible significants suggested by the experiments.

Dataset 1. We describe the first dataset used in our experiments. We randomly choose 5000 elements w from the set $W_a$ such that $\ell (w) < 30$ and $X_w(1) \ne \emptyset $. For each $w = t^{\lambda }u$, we express $\lambda = [\lambda _1,\lambda _2,\lambda _3,\lambda _4,\lambda _5]$ and $u = [u_1,u_2,u_3,u_4,u_5]$ as a permutation. We then compute the dimension $\dim X_w(1)$ for all these w.

Experiment 1. In this experiment, we use Dataset 1 to train a neural network to predict the dimension of $X_w(1)$ for each $w=t^{\lambda }u$, where $\lambda $ and u are defined as above. Specifically, the input vector for each w is defined as $X = [\lambda _1,\lambda _2,\lambda _3,\lambda _4,\lambda _5, u_1,u_2,u_3,u_4,u_5,\ell (u),\ell (w)]$, and the corresponding output is $Y = \dim X_w(1)$. We obtain a dataset of 5000 samples $\{X_i,Y_i\}_{i=1}^{5000}$ and train a neural network on it. The mean error was computed on the testing set to assess the neural network’s prediction accuracy on unseen data, reflecting the average discrepancy between the predicted ${\hat{f}}_{\theta ^*}(X_i)$ and true values $Y_i$. The experimental results are presented in Table 1.

Table 1 Testing error of different neural networks for Dataset 1

Machine Learning Assisted Exploration for Affine Deligne–Lusztig Varieties

Abstract

Similar content being viewed by others

Learning algebraic varieties from samples

On some simple geometric structure of affine Deligne–Lusztig varieties for $${{\,\textrm{GL}\,}}_n$$

Geometric Arveson–Douglas Conjecture-Decomposition of Varieties

1 Introduction

1.1 A Brief Overview of Affine Deligne–Lusztig Varieties

1.2 Machine Learning Assisting Pure Mathematics Research

1.3 Our Objective

2 Preliminaries

2.1 Definition and Properties of Affine Deligne–Lusztig Varieties

2.2 Important Invariants

2.3 Computing the Geometry of Affine Deligne–Lusztig Varieties

2.4 Machine Learning-Assisted Formula Exploration

3 Fundamental Concepts of Machine Learning and Associated Caveats

3.1 Machine Learning Models

3.1.1 Linear Models

3.1.2 Support Vector Machines (SVM)

3.1.3 Neural Network (NN) Regression

3.1.4 Neural Network (NN) Classification

3.2 Caveats

3.2.1 Choice of Model

3.2.2 Training and Testing, Overfitting

3.2.3 Randomness

3.2.4 Data Imbalance

3.2.5 Dataset Size

4 Programs

4.1 Program for Affine Deligne–Lusztig Varieties

Function (dim)

Function (irr)

Function (dim_irr_print)

Example

4.2 Program for Machine Learning

Function (GenerateDataset)

Function (LinearReg)

Function (LinearCls)

Function (NetReg)

Function (NetCls)

Function (NetGrad)

5 Searching for a Dimension Formula

5.1 Virtual Dimension

Theorem 1

5.2 Complexity Test

5.3 Linear Model

6 Searching for Important Features

6.1 Detailed Introduction to SVM Method

6.2 Experiments on the Non-emptiness Pattern

6.3 Experiments on the Dimension

6.4 Experiments on the Condition Virtual dim. = dim.

6.5 Statistics of the Difference of Virtual dim. and dim.

6.6 Experiments on the Irreducible Components

7 Lower Bound on the Dimension

7.1 General Setup

Theorem 2

7.2 Step 1: A Purity Result

7.3 Step 2: The Quantum Bruhat Graph

Theorem 3

7.4 Step 3: The Reflection Length as an Upper Bound

7.5 Step 4: Explicit Construction of the Lower Bound

7.5.1 The Split and \(\sigma = {\textrm{Ad}}(w_0)\) Cases

7.5.2 The Case \({}^2 D_{2k}\)

7.5.3 The Case \({}^3 D_4\)

7.5.4 The General Case

7.6 Final Comments

8 Conclusion

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search