1 Introduction

Machine learning methods are now an integral part of materials research [1, 2] thanks to the increasing availability of curated datasets containing structural and electronic data derived from quantum-mechanical calculations. Also, the accessibility and user-friendliness of Machine Learning algorithms and software have extended their utility to a broader scientific community. Machine Learning has been mostly used for predictive purposes, where algorithms are trained on data concerning a specific material property \(P_i\) with a set of features that can be related to \(P_i\) (i.e., the feature space \(\{f_n\}\)). Information such as numerical patterns and tendencies of \(P_i\) in the feature space \(\{f_n\}\) is then inferred for new materials that are not in the initial dataset. An ultimate goal for Machine Learning, however, is to build models that can go beyond prediction, with scientific concepts being discovered via descriptive models. This is a formidable task, either because the feature space is too large, or because the physical mechanisms behind a specific phenomenon are unknown, barely understood or complex. Some cases of success have been reported [3,4,5,6,7,8], including one that looks towards the discovery of semiconductors [9].

In this article, we use interpretable Machine Learning methods to analyze a fundamental property of semiconductors which lacks a detailed understanding: the reason why some semiconductors have a direct band gap while others have an indirect one. In semiconductors, the top of the valence band (representing occupied states) and the bottom of the conduction band (representing empty states) are separated by an energy gap, as shown in the blue region in Fig. 1. The top of the valence band and the bottom of the conduction band can be located at different momentum k-wavevectors (i.e., indirect band gap, as shown in Fig. 1a) or at the same k-wavevector (i.e., direct band gap, as shown in Fig. 1b). Determining the type of band gap is relevant for semiconductor applications. For instance, indirect band gap materials are usually not suitable for applications in optoelectronic devices because the absorption or emission of a photon requires an electron-lattice momentum exchanged, and such two-step process is less likely to occur. The most used material in the microelectronics industry, i.e. silicon, has an indirect band gap and thus is not suitable for optoelectronic devices such as light emitting diodes (LEDs). Transforming indirect band gap materials into direct band gap ones is still challenging. Common strategies for tuning indirect-direct band gaps include alloying [10, 11], exploring strains [12, 13], and quantum confinement [14]. The directness of the band gap is not difficult to determine, either theoretically or experimentally, as discussed in fundamental physics textbooks [15]. However, no unified theory exists to explain why one material has a direct or an indirect band gap. Semicondutors encompass a variety of materials and crystal structures, from simple diamond structures as Si to perovskites containing at least three different types of atoms and many structural distortions that can change their properties [16]. To the best of our knowledge, only a couple of works have focused on the explanation of the band gap directness in semiconductors. Yuan and collaborators [17] focused on Zincblende semiconductors, and have shown that materials with cations with occupied d-levels tend to have direct band gaps owing to the symmetry of the zincblende lattice. In the zincblende point group, the deep d-band interacts with the VB/CB at the \(\Gamma\), L and X high- symmetry points in the reciprocal space according to wavefunction symmetry at this point which is dictated by their symmetry at the k-point and the band symmetry representation. The d-bands do not interact with the bottom of the conduction band at the Gamma point (red dots in Fig. 1). This means that in the presence of occupied d-orbitals (Fig. 1b), the band repulsion at other k-points leads the CB higher in energy with respect to the Gamma point, which is kept fixed regardless the presence of d-orbitals due to the lack of interaction. This mechanism controls the direct/indirect gap for this specific crystal structure. In another paper, Choubisa and collaborators [9] proposed general rules to explain when a material has a direct or indirect band. The rules include the occupation of p orbitals, the position of the LUMO and the electronegativity of the constituent atoms. As we will show below, our rules are considerably different from these, providing new ways to design direct or indirect band gap materials.

Fig. 1
figure 1

Schematic representation of a band structure showing indirect and direct bandgaps (respectively, left and right panels). The blue regions stand for the bandgap and the blue dot at \(k_{1}\) is the top of the valence band. The bottom of the conduction band changes from the \(k_2\)-wavevector (blue point) for indirect bandgap to the \(k_1\)-wavevector (red dots) for direct bandgap

Herein, we conduct a broad analysis using data science and interpretable Machine Learning methods such as Decision Trees (DT) [18] and Random Forests (RF) [19]. In particular, we perform a descriptive analysis using the VAX method [20], extracting Jumping Emerging Patterns (JEPs, descriptive logic rules) [21, 22] from Machine Learning models, and then look for causal relations or insights to explain why a material has a direct or indirect band gap. Detailed information of the VAX method is provided in the Methodology. Our findings indicate that the directness of the band gap depends on the symmetry of the different materials. No general pattern (rule) was encountered which would apply to all structural families of materials. However, if we divide the materials in smaller groups of compounds with similar symmetries (e.g, Zincblende, Wurtzite, Rock Salt, and Perovskite structures), VAX generates specific patterns that explain their bandgap directness. Specifically, we recovered the known result for Zincblende structures, i.e., the existence of d orbitals. Furthermore, we have also found that relative energies of highest occupied states determine the direct–indirect bandgap transitions in other structures. We believe that these results demonstrate that explainable machine learning is promising to advance the understanding of physics problems.

2 Materials and methods

2.1 Materials and features

The data used in the Machine Learning algorithms was derived from online materials databases containing detailed band structure for thousands of materials. These include synthesized inorganic materials taken from the Inorganic Crystal Structure Database (ICSD), [23] and thousands of hypothetical materials whose properties were obtained in calculations using an extrapolating approach where chemical elements are replaced by similar ones within the same crystal structure. We used the Materials Project [24] database, version 2022, that contains 76022 gapped materials (from a total of 126335 entries) with their band structures, including information on whether their band gap is direct or indirect. All these calculations were performed with a GGA exchange correlation functional. To make our database more suitable for the Machine Learning analysis, a data curation process was conducted where we screened out some problematic materials. Below we list the main filters used in this process, together with the number of available materials after each filter:

  • We removed all materials with either a ferromagnetic or antiferromagnetic ground state configuration, leaving 41241 materials that have no spin polarization. This is done in order to simplify the analysis, and remove the complexity of magnetic systems;

  • Compounds that have more than 18 atoms in the unit cell were removed, leaving our dataset with a total of 10477 materials. Owing to the reduction of the size of the Brillouin Zone in very large unit cells, it becomes harder to differentiate among direct and indirect band gap materials.

To make our dataset more suitable for an interpretable analysis, as discussed in detail in the results section, we split our data into smaller datasets of materials with the same crystal structure. We chose to concentrate on the following structures: Zincblende (ZB) structure (61 materials), Wurtzite (WZ) structure (38 materials), and Rock salt (RS) structure (61 materials). The choice of these groups facilitates the analysis, since all these compounds are binaries (A is the cation and B is the anion). Table 1 lists the features used together with the labels employed in other figures for reference. We’ve used the Mendeleev package for retrieving the atomic features [25]. We have also analyzed perovskites, with chemical formula ABX\(_3\) (ABC\(_3\) in our nomenclature). The features used for the ternary perovskites are basically the same as those for binary materials (Table 1), however we needed to add another C element. Also, when the features are related to differences among two values (e.g. \(\Delta _X\) for the difference in electronegativity), we included the differences among atoms A and B, B and C, and A and C.

To select the materials from a specific group on the Materials Project, we restrict our search for a specific space group and number of sites in the primitive cell. For ZB we chose only the entries with space group F-43 m (216) and having only two sites in the primitive cell; for Rocksalt we selected space group Fm-3 m (225) with only 2 sites in the primitive cell and composition AB; for wurtzite we selected materials with space group P63mc (186) with only 4 sites and composition AB; for cubic perovskites we used the Pm-3 m symmetry with 5 sites in the primitive cell and composition ABC3; finally, for non-cubic perovskites we manually selected the compounds with compositions ABC3 with the characteristic topology of interlinked octahedra. The space groups were defined by the Materials Project database, using the package spglib with the default tolerance metrics.

Table 1 Set of features used in the zincblende, wurtzite and rock salt dataset; A: cation features; B: anion features. Atomic features were extracted from Ref. [25]

2.2 The VAX method

The multiVariate dAta eXplanation (VAX) method [20] provides multivariate data analysis leveraging Machine Learning models’ descriptive power [26]. The main idea is to extract, select, and aggregate the so-called Jumping Emerging Patterns (JEPs) [22, 27] from Random Forest (RF) models [19, 28]. JEPs are descriptive logic rules [21, 22] where the variable selectors describe inter, and intra-class relationships. Once the JEPs are extracted, selected, and aggregated maximizing confidence and support, they are displayed into a matrix-like visual metaphor in VAX. The latter is shown in Fig. 2 (exploded view) for the 61 materials of the Zincblende dataset, 38 of which have direct gap. This example will be further discussed in another figure, without the notes, in the results section.

Fig. 2
figure 2

The matrix-like visual metaphor employed by VAX using the Zincblende dataset. JEPs are displayed as rows (class color: indirect as blue and direct as orange). Variables are arranged as columns. Local histograms are shown as cells. Global histograms (one row per class) are placed on the top. Pattern support. Cumulative coverage assuming the matrix order (top to bottom). Variable importance, pattern support, cumulative coverage, and variable importance are mapped to size and brightness (gray-scale). FET (Fisher Exact Test) significance value (p) colored as green (statistically significant, \(p \le 0.05\)) or purple (not significant, \(p > 0.05\))

2.2.1 Patterns and variables

Patterns are arranged as rows () and variables appear as columns (). In Fig. 2, patterns \(p_9\) and \(p_5\) are placed at the first and second rows, involving variables ‘A_NValence’ and ‘total_N’ at the first and second columns. Classes are mapped to categorical colors, with blue assigned to class “indirect gap” and orange assigned to the class “direct gap”.

2.2.2 Local histograms and selectors

The matrix cells () present, for a particular pattern, local histograms showing the supported data instances distributed along the available variables. Moreover, a continuous line frame into the matrix cell represents a variable selector, that is a distinct range of values defined by upper and lower bounds. For example, in the histograms using 5 bins in Fig. 2, the rule \(p_9\) (first row) contains a selector of 7.5 to 29.0 for variable ‘A_NValence’ (first column). The values in this cell can range from 2.0 to 29.0 (min. and max. values for ‘A_NValence’ considering all 61 materials). The pattern \(p_9\) also contains a selector of 51.0 to 132.0 for variable ‘total_N’ (second column) from the possible range from 12.0 to 132.0 (min. and max. values). Thus, pattern \(p_9\) supports (explains) direct gap materials (orange) with values between 7.5 and 29.0 (selector) for variable ‘A_NValence’ and 51.0 and 132.0 for variable ‘total_N’. In contrast, pattern \(p_5\) (second row) has a selector of 2.0 and 9.5 for variable ‘A_NValence’ (first column). Despite not having a selector for variable ‘total_N’ (second column), the local histogram is also placed into the matrix cell. Hence, pattern \(p_5\) supports (explains) indirect gap materials (blue) with values between 2.0 and 9.5 for variable ‘A_NValence’. Comparing patterns \(p_9\) and \(p_5\), indirect materials tend to have lower values than direct materials for both variables ‘A_NValence’ and ‘total_N’.

2.2.3 Global Histograms, Support, and Cumulative Coverage

Global histograms for each class are shown on the top of the matrix () and the patterns’ support (class percentage) is mapped onto a column on the matrix’s left side (). The cumulative dataset coverage is also mapped onto a column on the matrix left side (), representing the cumulative percentage of data instances in a dataset covered (explained) by the patterns in that row and those in the rows above it. In this context, pattern \(p_9\) in Fig. 2 has a support value of 0.66, which means that 25 of the 38 (\(66\%\)) direct gap materials are supported (explained) by \(p_9\). The local histograms for pattern \(p_9\) (matrix cells at first row) arrange the 25 supported materials distribution along variables ‘A_NValence’ and ‘total_N’. The global histograms for the “direct gap” class (orange) display the distribution for each variable on the set of all 38 Zincblende materials. Comparing local and global histograms bins for variable ‘A_NValence’ (first column), \(23\%\) of the 38 direct gap materials at the first bin-global (  – first bucket in the global histogram for direct gap class) are completely absent in pattern \(p_9\), since it lacks (\(0\%\)) the first bin-local (  – first bucket in the local histogram in the first row). However, from the \(32\%\) in the second bin-global, \(24\%\) are enclosed by the second bin-local for pattern \(p_9\). The pattern \(p_5\) has a support value of 0.57, i.e it explains 13 of 23 (\(57\%\)) indirect gap materials. For variable ‘A_NValence’ (first column), \(78\%\) of indirect materials are at the first bin-global, from which \(52.5\%\) are retained by the first bin-local for pattern \(p_5\). The cumulative coverage for patterns \(p_9\) and \(p_5\) is 0.62, i.e. 38 \((25 + 13)\) of 61 materials (\(62\%\)) are covered by the two patterns. Therefore, more than half of the materials are explained by only two patterns (\(p_9\) and \(p_5\)).

2.2.4 Variable Importance and FET p-value

The variable importance is shown visually above the global histograms and textually, next to the variables’ name at the bottom (). The FET \(p-value\) (statistical significance test) for each pattern is displayed in a column to the matrix right side (), where green means statistically significant (p values below 0.05) and purple otherwise. The variable ‘A_NValence’ in Fig. 2 is more important than ‘total_N’ (0.179 against 0.104). Moreover, both patterns \(p_9\) and \(p_5\) are statistically significant (green for FET p).

In addition to the visualization of JEPs, VAX provides similarity maps for data instances using Dimensional Reduction (DR) layouts [29] and patterns perspectives (dataset extension [30]). Clusters (groups of close data instances) and outliers (data instances apart) are often revealed in such maps, and more importantly, are interpreted by JEPs visualization analysis. Figure 3 presents a map for the Zincblende dataset, where clusters and outliers can be identified. Each point represents a zincblende material and the color refers to patterns support. Points colored in pink identify the materials supported (explained) by pattern \(p_9\), whereas those supported by \(p_5\) are in purple. The materials colored in the map compose two clusters, and the difference between such clusters is the difference between the patterns that originate them (\(p_9\) and \(p_5\)). So, as noticed from the JEPs visualization in Fig. 2, the distinction between the cluster originated by pattern \(p_9\) and the one formed by pattern \(p_5\) is that materials in the latter trend to have lower values for variables ‘A_NValence’ and ‘total_N’. Moreover, an outlier is highlighted in Fig. 3 (isolated point on the right), representing a material that does not belong to any of the clusters and is far away from other materials as well. A support matrix (binary heat map) can be shown relating patterns (rows) and the supported data instances (columns). For the Zincblende dataset, Fig. 4 presents the support matrix for patterns \(p_9\) and \(p_5\) of Fig. 2. Pattern \(p_9\) supports AgBr, InAs, CdSe, CdTe, GaSb, Agl, CdS, GaAs, Cul, InN, ZnSe, BiB, ZnTe, HgSe, TIN, SnC, TIP, HfC, CuBr, InP, SnTe, Sn, HgS, InSb, and HgTe; and pattern \(p_5\) supports AlAs, AlN, SiC, BSb, BP, YN, ZrC, BN, C, AlSb, AlP, BAs, and RuC.

Fig. 3
figure 3

Similarity Map for the Zincblende dataset within the JEPs perspectives. Materials whose band gap direct–indirectness are explained by patterns \(p_9\) and \(p_5\) cluster together and are identified as pink points (pattern \(p_9\)) and purple points (pattern \(p_5\)). An outlier is also highlighted (at the most right position)

Fig. 4
figure 4

Support Matrix for the Zincblende dataset. Materials explained by patterns \(p_9\) and \(p_5\) (class color: indirect gap materials as blue and direct gap materials as orange). Pattern \(p_9\) supports 25 direct materials (AgBr, InAs, CdSe, CdTe, GaSb, Agl, CdS, GaAs, Cul, InN, ZnSe, BiB, ZnTe, HgSe, TIN, SnC, TIP, HfC, CuBr, InP, SnTe, Sn, HgS, InSb, and HgTe), whereas pattern \(p_5\) supports 13 indirect (AlAs, AlN, SiC, BSb, BP, YN, ZrC, BN, C, AlSb, AlP, BAs, and RuC)

In summary, VAX employs JEPs and visualization of similarity maps to examine multivariate datasets aiming at knowledge discovery and phenomena analysis. By displaying JEPs using a matrix-like visual metaphor, meaningful visual explanations can be reached by filtering and ordering patterns (rows) and variables (columns). In turn, similarity maps convey an overview of the dataset (e.g., clusters and outliers) using JEPs context. It is worth mentioning that VAX was developed for descriptive analyses [20]. Unlike predictive models, which focus on creating generic models (splitting datasets into training and testing subsets for estimating performance [31]), descriptive tasks are not intended to be generic outside the scope of the data under analysis. Instead, they should offer explanations of the phenomena observed in a single dataset [32, 33]. Hence, VAX provides JEPs with “\(100\%\) accuracy” for the dataset under examination. This is needed in order for us to be able to obtain patterns that work for all data instances. This is why we are able to find patterns even for all outliers. These are some of the advantages of VAX when compared to other methodologies.

3 Results and discussion

This section presents the descriptive analyses of the materials datasets using the VAX method [20]. The latter is available as a code package Footnote 1 for Python programming language. The source code for the results presented here are accessible as Python Jupyter notebooks, one for each material family: All Materials ,Footnote 2 Zincblende ,Footnote 3 Rock Salt ,Footnote 4 Wurtzite ,Footnote 5 and Perovskites .Footnote 6

3.1 All materials

The VAX method was applied to the whole dataset of 10477 (post filtering) materials, 7432 of which have indirect gap while 3045 have direct gap. A total of 1495 patterns (JEPs) were obtained, and the patterns with highest support, \(p_{1398}\) and \(p_{842}\), are shown in Fig. 5. Pattern \(p_{1398}\) explains \(\sim 3\%\) of direct gap materials (80 of 3045), whereas \(p_{842}\) supports \(\sim 2\%\) of indirect gap materials (148 of 7432). These values are statistically significant (green for FET p), but they together cover only \(2\%\) (228 of 10477) of the dataset (cumulative coverage). Interestingly, the remaining 1493 patterns (not presented in Fig. 5) also can have supports around to 2%. This decreased representativity of these patterns comes from the complexity of this dataset, with a huge variety of crystal structures and the lack of good features encoding symmetry.

Fig. 5
figure 5

The two patterns (\(p_{1398}\) and \(p_{842}\)) with the highest support filtered out the 1495 JEPs (extracted, selected, and aggregated) from the dataset with all materials (10477 – 7432 indirect gap and 3045 direct gap). Patterns (rows) are ordered by support and the variables (columns) are organized by importance. Pattern \(p_{1398}\) holds the highest support among the 1495 patterns; yet it supports only \(\sim 3\%\) of direct gap materials (80 of 3045). Since pattern \(p_{842}\) has \(\sim 2\%\) of support, the 10493 remaining patterns (not shown) have equal or less than \(2\%\)

By inspecting the materials supported by \(p_{1398}\) and \(p_{842}\) (in Figures S1 and S2 of the supplementary information), we noted that the Machine Learning model (descriptive RF post-processed by VAX) put together materials with similar spatial atomic arrangements. Pattern \(p_{1398}\) supports (explains) only two-dimensional heterostructures in the database. There were different types of stacking of 2D materials such as MoS\(_2\)/MoSe\(_2\), WS\(_2\)/WSe\(_2\), all of which have direct band gaps. For pattern \(p_{842}\), which works for indirect band gaps, all the materials explained are oxides, mostly containing Bi. This motivated us to split the database into different families of materials with similar crystal structures. By doing so, we avoid the need to specify the crystal structure and symmetry features. Conversely, we infer that crystal structure and symmetry should be important to determine whether a material has a direct band gap. In the following subsections, we analyze materials grouped according to their crystal structures.

3.2 Zincblende

The zincblende (ZB) dataset is a good starting point for the analysis, since we can compare our results with those from Yuan et al. [17], and evaluate their d-band model. The ZB dataset contains 61 materials, with 38 direct gap materials. This dataset is larger than the one used in reference [17], thus permitting a broader analysis. In this dataset were also included elements that share the same crystal structure, but are composed by only one type of atom (C, Si, Ge and Sn). In this case, the materials actually have the diamond crystal structure. As shown in Table 1, our feature set includes those necessary to gather the information available in the work of Yuan [17], including the presence of d electrons and bond lengths. It also includes features such as electronegativity, electronaffinity, and the rows on the periodic table where the constituent atoms are located.

A total of 9 patterns (JEPs) were obtained for the Zincblende dataset. Figure 6 shows the two highest support patterns: \(p_9\) and \(p_5\) (first and second rows). Pattern \(p_9\) supports \(66\%\) of direct materials (25 of 38), whereas \(p_5\) supports \(57\%\) of those indirect (13 of 23). Together these patterns explain the majority of the compounds, that is \(62\%\) of the zincblende materials (38 of 61). The most important variable was the number of valence electrons of the cation, represented by ‘A_NValence’ (first column). The 25 direct gap materials from pattern \(p_9\) have values between 7.5 to 29.0 (selector) for variable ‘A_NValence’. On the other hand, the 13 indirect gap materials from \(p_5\) have values between 2.0 and 9.5. It is worth mentioning that pattern selectors min. and max. values (ranges learned by the Machine Learning model) may not have feasible numbers, e.g. 7.5 for variable ‘A_NValence’. In this case, these values must be interpreted by rounding up. Linking the patterns \(p_9\) and \(p_5\) with the limit of 6 rule found by Yuan et al. [17], if variable ‘A_NValence’ is larger than 6, there will be a high probability for the materials to have a direct band gap. If the number is smaller than 6, indirect band gaps are more frequent. This limit of A_NValence = 6 is correlated with the existence of d electrons in the cations: within the pseudopotential approach of our database, A cations can only have more than six valence electrons if they have filled d orbitals. This conclusion is the same made by Yuan et al. [17] for zincblende compounds. Therefore, we successfully reproduced the main result obtained in that paper. A similar analysis holds for the second most important variable (total_N).

Fig. 6
figure 6

The two patterns (\(p_9\) and \(p_5\) – first and second rows) with the highest support filtered out the 9 JEPs resulted (extracted, selected, and aggregated) from the Zincblende dataset (61 materials – 23 indirect and 38 direct). Patterns (rows) are ordered by support and the variables (columns) are ordered by importance. The variable ‘A_NValence’ (first column) is the most important, where low values are related to indirect materials (blue) and high (also medium) values to those direct (orange). Together, patterns \(p_9\) and \(p_5\) explain \(62\%\) (38 of 61 – more than half) of the Zincblende materials

Figure S3 in the Supplementary Information presents all 9 patterns for the ZB case, while Figure S4 shows the similarity map and Figure S5 shows the support matrix. Most of the 9 patterns returned by VAX are used to explain materials that do not fit into the ‘d-electron’ rule. Some of the patterns were actually used to explain only a few materials, which means that a different physical mechanism is needed to explain the directness of their band gap. These include Ge, GaP, CaSe and others. Moreover, pattern \(p_5\) explains materials such as C, Si, SiC, BN, BP, AlN, AlP, YN, and ZrC as indirect band gap because they do not have d bands.

3.3 Rock salts

The Rock Salt (RS) dataset has the same number of materials as the one for ZB, i.e. 61 in total (32 with direct and 29 with indirect band gaps). We used the same variables employed for ZB, and found 9 patterns (JEPs) for the RS dataset. The highest support patterns \(p_8\) and \(p_9\) are shown in Fig. 7. It is important to note that pattern \(p_9\) for the RS dataset is different from pattern \(p_9\) of the ZB dataset since they involve different Machine Learning models. Pattern \(p_8\) supports \(88\%\) of direct gap materials (28 of 32), whereas \(p_9\) supports \(79\%\) of those with indirect band gaps (23 of 29). Together these patterns explain \(84\%\) of the rock salt materials (51 of 61). Even with the same number of patterns (9 JEPs), the coverage for the two highest support patterns (\(p_8\) and \(p_9\)) for RS is larger than for ZB (with patterns \(p_9\) and \(p_5\)), i.e. \(84\%\) against \(62\%\). Thus, determining the directness of the band gap is easier for RS than for ZB. The remaining results for the RS dataset are shown in the Supplementary Information with all the patterns found, the similarity map and the support matrix shown in Figure S6, Figure S7 and Figure S8, respectively. By inspecting the second most important variable (second column) for patterns \(p_8\) and \(p_9\) in Fig. 7, which is ‘A_top_energy’, one notices distinct histogram distributions for direct and indirect band gaps. ‘A_top_energy’ indicates the energy of the highest occupied atomic level of atom A. \(66\%\) of direct gap materials (21 of 32) are found in the last bin-local for variable ‘A_top_energy’ in pattern \(p_8\) in contrast to \(7\%\) of indirect gap materials (2 of 29) in pattern \(p_9\). Hence, if ‘A_top_energy’ is high, there is a strong tendency for the band gap to be direct. This trend is used here to explain the indirect gap materials supported by pattern \(p_9\), and explicitly listed in Figure S9. By analyzing this list we realize that these indirect band gap materials have as cations Be, Mg, Ca, Sr or Ba (smaller ‘A_top_energy’). These are compounds where the cation has a completely filled valence s orbital, which can be related to a smaller ‘A_top_energy’.

Fig. 7
figure 7

The two patterns (\(p_8\) and \(p_9\) – first and second rows) with the highest support filtered out the 9 JEPs (extracted, selected, and aggregated) for the Rock Salt dataset (61 materials – 29 with indirect and 32 with direct band gaps). Patterns (rows) are ordered by support and the variables (columns) are ordered by importance. The variables ‘delta_X’ and ‘A_top_energy’ (first and second columns) are the most important. For ‘A_top_energy’ (second column), direct materials (orange) concentrate (\(66\%\) – 21 of 32) in the last histogram bin-local for pattern \(p_8\), whereas pattern \(p_9\) reveals only a few (\(7\%\) – 2 of 29) indirect materials (blue) for such a bin. Together, patterns \(p_8\) and \(p_9\) explain \(84\%\) (51 of 61 – strong generic behaviour) of the rock salt materials

To further evaluate the explanations provided in patterns \(p_8\) and \(p_9\), we selected two compounds with similar chemical structures, but with different band gap behaviors: CaSe has an indirect band gap, and KBr has a direct band gap. All these atoms belong to the same period of the periodic table. Figure 8 shows the band structure of these compounds. The conduction band minimum of KBr (direct band gap) is composed of K s orbitals and the band gap is at the \(\Gamma\) point. For CaSe, CBM is composed of Se p orbitals and is located at the X point. The Ca s orbitals are higher in energy, turning the band gap indirect. The relative position of the K (Ca) s orbitals, related to the variable ‘A_top_energy’, is key in defining the directness of the band gap.

Fig. 8
figure 8

Band structures of KBr (above) and CaSe (below). The colors in the graphs indicate the atomic contribution for each eigenvalue. These band structures were calculated using DFT with a GGA exchange correlation functional

3.4 Wurtzites

The Wurtzite dataset has 38 materials, 29 of which have direct band gaps. Wurtzite (WZ) compounds have a crystal structure similar to that of Zincblende materials. Each cation is bound to four anions, but differently from zincblende, one of these bonds has a different length than the others. Also, the stacking along the (111) direction is different in these two crystal structures, with ZB showing an ABCABC... stacking (three different layers) and WZ having an ABAB... configuration (two different layers) [34]. A total of 4 patterns (JEPs) were obtained for the WZ dataset, whose patterns are in descending order of support in Fig. 9. Patterns \(p_3\) and \(p_2\) (first and second rows) yield the highest support, covering \(82\%\) of the wurtzite materials (31 of 38). This coverage with the two highest support patterns is almost the same as that in the RS database. Pattern \(p_3\) supports \(89\%\) of indirect gap materials (8 of 9), whereas \(p_2\) supports \(79\%\) of the direct gap ones (23 of 29). A comparison of the two patterns indicates that the direct gap materials (orange) tend to have higher values for variables ‘total_N’ and ‘AB_bond_len’ (second and first columns). Note that there is a correlation between these variables: compounds with a larger number of electrons tend to have larger bond lengths. This analysis also leads to a similar interpretation as that for the ZB dataset: a larger number of electrons appears when the constituting atoms have d orbitals. The similarity of the ZB and WZ patterns comes from their similar crystal structures.

Fig. 9
figure 9

All 4 patterns resulted (extracted, selected, and aggregated) from the Wurtzite dataset (38 materials – 9 with indirect and 29 with direct gaps). Patterns (rows) are ordered by support and the variables (columns) are ordered by importance. The patterns \(p_3\) and \(p_2\) (first and second rows) have the highest support. For variables ‘total_N’ and ‘AB_bond_len’ (second and first columns – the most important) direct gap materials (orange) tend to have higher values than indirect gap compounds (blue). Together, patterns \(p_3\) and \(p_2\) explain \(82\%\) (31 of 38 – strong generic behaviour) of the wurtzite materials

The similarity map in Figure S10 in the Supplementary Information highlights two clusters formed by patterns \(p_3\) and \(p_2\) (pink and purple). A third cluster is formed with pattern \(p_4\) (maroon) and there is an outlier supported by pattern \(p_1\) (emerald). Figure S11 presents the materials supported by patterns \(p_3\), \(p_2\), \(p_4\), and \(p_1\).

The difference among clusters (purple and maroon – Figure S10) formed by patterns \(p_2\) and \(p_4\) is in variable ‘total_N’ (second column) and ‘delta_X’ (fourth column). This last variable indicates the difference in electronegativity between atoms A and B. The materials from pattern \(p_4\) have large values of ‘delta_X’, indicating they are more ionic. This also leads to larger band gaps, making the coupling between d bands and the conduction band smaller, decreasing the importance of the d bands and making the band gap direct.

The indirect material AlP, an outlier in Figure S10, is supported by pattern \(p_1\) (fourth row in Fig. 9) and it differs from the 8 indirect materials (C, Si, BP, BN, GeC, BaO, BAs, SiC) by pattern \(p_3\) (first row in Fig. 9) in variable ‘A_top_energy’ (seventh column). The AlP compound has a higher value for variable ‘A_top_energy’ (between \(-0.11\) to \(-0.1\) – variable selector) than the 8 materials explained by pattern \(p_3\) (between \(-0.22\) to \(-0.11\) – variable selector).

3.5 Perovskites

Perovskites are much more complex than the other materials discussed before. They comprise three types of atoms (\(\hbox {ABX}_{3}\)), with a variety of chemical formulas and many crystal structures. Although the general chemical formula has an \(\hbox {ABX}_{3}\) stoichiometry, there are also double perovskites such as \(\hbox {A}_{2}\hbox {BB}' \hbox {X}_{6}\) [35] and various magnetic configurations [36]. As for the crystal structure, perovskites can be cubic, tetragonal, orthorhombic and even polymorphous [37]. All these properties can influence the band gaps [38]. The Perovskite dataset has 132 materials, of which 104 have indirect band gaps. For perovskites we used a different set of variables compared to the binary compounds. The variables in Table 1 for binary materials are not able to fully encode their chemical space. Additionally, the dataset contains not only cubic perovskites, but also perovskites with rotated octahedra. Thus, we need variables that also discriminate this structural degree of freedom.

A total of 13 patterns (JEPs) were obtained for the Perovskite dataset, and Fig. 10 shows the 6 patterns yielding the highest support. These 6 patterns are the only ones with statistical significance. Pattern \(p_4\) (first row) supports \(58\%\) of indirect gap materials (60 of 104), whereas \(p_9\) supports \(50\%\) of those with direct gap (14 of 28). Together these patterns explain about half (\(56\%\)) of the compounds, i.e. 74 of 132 materials. The main difference between patterns \(p_4\) and \(p_9\) is in variables ‘total_N’ and ‘A_p’ (first and last column), since direct gap materials (orange) tend to have higher values than the indirect gap materials (blue). These variables respectively indicate the total number of electrons and the relative energy of the highest occupied p orbital of atom A.

The 14 direct gap materials from pattern \(p_9\) contain values between 121 and 190 (selector) for ‘total_N’ and from \(-0.54\) to 0.0 (selector) for variable ‘A_p’. This latter tendency for variable ‘A_p’ is also found in the remaining patterns \(p_{13}\), \(p_2\), \(p_{12}\), and \(p_1\) (third to sixth row) for direct gap materials. As observed for Zincblende and Wurtzites, the trend for the variable ‘total_N’ indicates that compounds with a larger number of electrons have direct band gaps, which is correlated with the existence of filled d bands.

Fig. 10
figure 10

The 6 patterns with the highest support filtered out from the 13 JEPs resulted (extracted, selected, and aggregated) for the Perovskites dataset (132 materials – 104 indirect and 28 direct). Patterns (rows) are ordered by support and the variables (columns) are ordered by importance. The variable ‘total_N’ (first column) is the most important; for patterns \(p_4\) and \(p_9\) (first and second rows) medium and low values are related to indirect gap materials (blue) while high values refer to direct gap materials (orange). For variable ‘A_p’ (last column), all patterns (\(p_9\), \(p_{13}\), \(p_2\), \(p_{12}\), and \(p_1\)) for direct gap materials (second to sixth row) contain higher values than for pattern \(p_4\) (indirect gap materials). These six patterns explain \(67\%\) (89 of 132) of the Perovskites materials

Figure S12 in the Supplementary Information shows all the 13 patterns for the perovskites dataset, while Figures S13 and S14 display the similarity map and the support matrix, respectively.

4 Conclusions

Machine learning methods in materials science have been mostly used to predict a specific property of a material. Examples include models to predict properties such as the value of band gaps [39], stability of 2D compounds [40], or the magnetic order of compounds [5]. Recent works [6, 41], however, indicate that Machine Learning can be used in more complex tasks, including interpretation of physical phenomena. In this paper, we discovered through using VAX that to explain the band gap directness the semiconductors have to be separated into smaller groups with the same crystal structure. Indeed, we could determine the rules that explain why a semiconductor structural family has a direct or indirect gap. This has been done for zincblende materials, confirming the conclusions from an earlier work [42], and for other crystal structures. Overall we observed that two main features are essential to determine if a material has a direct or indirect band gap. The first feature is the symmetry of the crystal, and consequently the symmetry of each high-symmetry k-point in the reciprocal space. This conclusion was inferred from the fact that we were unable to get a good Machine Learning model using all materials. The creation of subsets with the same crystal structure was necessary to get good results. The second feature is the existence of specific orbitals (e.g., d-orbitals) and their respective energy and occupation. The interaction among these orbitals may push the valence band maximum/conduction band minimum (VBM/CBM) at specific k-points towards higher/lower energies, leading to changes in the band gap directness.

In addition to providing explanations about the directness of semiconductor band gaps, the use of the VAX in this paper showcases new ways to employ machine learning to study materials in an interpretable way. For the VAX method can be used in conjunction with any rule-based machine learning algorithm.