1 Introduction

In this paper, we present findings towards employing sparse connectivity in order to reduce the memory consumption of the classification layer for problems with extremely large output spaces (XMC). Such problems arise in, e.g., tagging of text documents [8], next-word predictions [21], and different kinds of recommendation tasks [1, 5, 19, 24, 29]. In order to ensure computational tractability of these tasks, which can have up to several millions of labels, one typically builds a hierarchical label tree [14, 23, 30, 32], only exploring branches that are likely to contain relevant labels for the current instance. Even though this is very effective at reducing the computation (from linear to logarithmic in the number of labels), it does not help in addressing the memory consumption, which is still linear in the number of labels times the number of hidden units.

As an illustration consider the Amazon-3M [18] dataset. If we were to map the inputs to a hidden representation of 1024 units, the fully connected last layer for this dataset would need about 2.9 billion parameters, corresponding to 10.7 GiBFootnote 1. Given that modern deep learning optimizers such as Adam [16] need to keep track of the value, gradient, and first and second moment, this leads to an overall peak memory consumption of over 40 GiB, making it nigh impossible to train such models on commodity hardware.

Therefore, we want to investigate possibilities for memory efficient sparse training of this huge last layer. There are two pre-existing approaches that serve as an indication that this is an idea that could be successful: First, for DiSMEC, a linear model applied to tf-idf representations of input text, it is known that the resulting layer can be sparsified after training to contain less than 1% non-zeros [2]. In a linear model, the different classifiers for each label can be trained independently. As a result, only the full weights of the label that is currently trained needs to be kept in memory, and can be pruned as soon as the training for that label has finished. For non-linear models, the Mach [19] algorithm can be interpreted as a special case of training with static, random sparsity. It works by hashing the labels into different buckets, and performing training and predictions only on the level of buckets. If enough independent hashes are used, this method allows to solve the original problem in the large output space. However, in practice, the results presented for Mach are not as good as for competing methods.

The contributions of this paper are as follows: We show that naïvely applying a dynamic sparse training algorithm to the last layer of an XMC problem results in strongly reduced predictive performance. Inspired by Mach, we then propose to alleviate this problem by inserting a penultimate layer that is larger than the hidden representation of the inputs, but still much smaller than the size of the label space. Such an increased layer size drastically improves the chances of dynamic sparse training finding a good subnetwork, and enables us to get results only slightly worse than training with a dense last layer. We demonstrate this on several large-scale datasets, for which we train a classification layer on a fixed set of pre-trained features. To ensure memory efficient and quick computations, we propose to restrict the sparsity structure to constant fan-in, such that each unit in the output layer receives exactly the same number of inputs. This has several important consequences: (i) it makes it impossible for the training to focus most non-zero weights on a few, prominent head labels, and instead ensures a more even distribution of the representational capacity, (ii) compared to coordinate-format this requires only half the memory to store the indices, and compared to compressed row sparse matrices the data layout is simpler, making it easier to implement the corresponding operations on a GPU, and (iii) it also means that changing the sparsity structure (redistribution of connections) can be implemented as a very cheap operation.

2 Setup and Background

We consider classification problems that map an input instance \({x}\in \mathcal {X}\) to a subset of a label set with \({m}\) labels, represented as a binary vector \(\boldsymbol{y}\in \lbrace 0, 1\rbrace ^{{m}}\). More precisely, we assume that \((x, \boldsymbol{y}) \sim \mathbb {P}\) are jointly distributed according to some probability measure. If almost surely \(\Vert \boldsymbol{y}\Vert _1 = 1\), it is a multiclass setup, otherwise a multilabel setup. We want to find a classifier \({f}:\mathcal {X} \longrightarrow \lbrace 0,1\rbrace ^{{m}}\) so that predicted labels \({\hat{\boldsymbol{y}}}={f}({x})\) and actual labels are close. Usually, \({f}\) can be decomposed into two operations: First, the inputs are embedded into a fixed-size vector space using a function \({\psi }:\mathcal {X} \longrightarrow \mathbb {R}^e\) (e.g. a linear projection, multilayer perceptron, or transformer-based text model), and then a decoding \({{\textbf {W}}}\in \mathbb {R}^{e \times {m}}\) is applied to extract scores for each label. The actual prediction is then generated by selecting the k highest scoring labels as positive, \({\hat{\boldsymbol{y}}}= \text {top}_k ({{\textbf {W}}}^{\textsf{T}} {\psi }({x}))\). Consequently, performance is typically measured in terms of precision-at-k, defined as the fraction of correct predictions

(1)

In order to find the optimal \({{\textbf {W}}}\) that maximizes \(\text {P}@k\), one often performs a One-vs-All (OvA) reduction [2, 3, 20]: A binary classification loss \(\ell \) is applied to each label separately. As this involves evaluating the scores \({{\textbf {W}}}^{\textsf{T}} {\psi }({x})\) for each label, many methods select a subset \(\mathcal {N} \subset [ {m} ]\) of hard negatives [7, 12, 14, 15, 26], to approximate the sum as

$$\begin{aligned} \begin{aligned} l(\boldsymbol{y}, {x}) = \sum _{j=1}^{{m}} \ell (y_{j}, \boldsymbol{w}_j^{\textsf{T}} {\psi }({x})) &= \sum _{j: y_{j} = 1} \ell (1, \boldsymbol{w}_j^{\textsf{T}} {\psi }({x})) + \sum _{j: y_{j} = 0} \ell (0, \boldsymbol{w}_j^{\textsf{T}} {\psi }({x})) \\ &\approx \sum _{j: y_{j} = 1} \ell (1, \boldsymbol{w}_j^{\textsf{T}} {\psi }({x}) ) + \sum _{j \in \mathcal {N}} \ell (0, \boldsymbol{w}_j^{\textsf{T}} {\psi }({x})) \,. \end{aligned} \end{aligned}$$
(2)

This is very effective at reducing the required computations, and could also be beneficial for accuracy because it effectively changes the distribution of labels seen by the classifier [25], but it does not decrease the enormous amount of memory required to store the weight matrix \({{\textbf {W}}}\).

There are several established approaches to handle this problem: The most straightforward method is to place a bottleneck layer just before the final classification layer, so that the dimension of the embedding that \({{\textbf {W}}}\) operates on is comparatively low. For example, LightXML [14] project the 3280-dimensional representation used for determining hard negatives down to only 300 units for the extreme-level classification. This approach is limited in its effectiveness, as too small sizes start to severely affect the classification quality. A second strategy is to prune the matrix \({{\textbf {W}}}\) after training, turning it into a very sparse matrix. This can reduce the model size to only a tiny fraction of the dense equivalent, without negatively affecting its predictive power, but this does not solve the problem of memory consumption during the training itself. The only exception are linear models, where the weight vectors \(\boldsymbol{w}_j\) for different labels can be trained independently, and be sparsified immediately after training, so that the full matrix never has to materialize [2, 3]. Additionally, it is possible to exploit the relation between primal and dual of linear problems to achieve sparse training for max margin classifiers with appropriate loss functions [31]. Finally, Mach [19] has shown that it is possible to train an extreme classifier on the level of meta-labels, obviating the need for the large weight matrix \({{\textbf {W}}}\) altogether. However, this corresponds, implicitly, to a multiplication by a sparse, fixed, binary matrix, which therefore limits the expressiveness of the model, and it requires keeping multiple copies of the embedding network \({\psi }\).

Thus, existing sparse training methods for XMC either use post-training sparsification, or a fixed sparsity structure. Here, we want to apply the sparse evolutionary training (Set) algorithm [22] to the classification layer, so that we have sparse training with dynamic sparsity structure. The Set algorithm follows a general prune-redistribute-regrowth cycle, which means that periodically, a subset of existing non-zero weights is selected to be removed (pruned), and new structural non-zeros will be inserted (redistributed). After that, the training of the sparse layer proceeds just as in any other gradient-based optimization, i.e., the structural non-zeros are updated according to their mini-batch gradient (regrown), and the structural zeros are left unchanged, until the next cycle.

This general algorithmic structure can be implemented in various ways, depending on how the pruned weights are selected, and how it is determined where they should be re-distributed. [11] The Set algorithm uses very simple heuristics: The set of least important connections is determined by sorting according to the absolute value of their weight, and removing the fraction \(\alpha \) of connections with lowest weight. The same number of new connections is inserted after pruning, by choosing uniformly randomly from the structural zeros.

While there exist more elaborate schemes, they are generally more complex to implement and will require additional memory. For example, [4] chooses its pruning based on weights switching their sign, which means that it needs to store the previous signs of all structural non-zeros. To determine useful locations for inserting the redistributed connections, [9] uses a momentum term, which means that this requires the same amount of memory as the weights for the original dense layer, and thus is infeasible in our setting. This also excludes any strategy that requires, even if only intermittently, a full, dense gradient, such as [10].

A naïve application of Set to the last layer leads to unsatisfactory results, and an implementation using just the available tools in tensorflow turns out to be suboptimal in terms of speed and memory consumption. Thus, we present in the next section some modifications to the architecture and training algorithm, as well as insights into an efficient implementation, to alleviate these shortcomings.

3 Method

In principle, implementing a sparse layer in tensorflowFootnote 2 is straightforward: Replace the dense-dense matrix multiplication with a sparse-dense operation that is supplied by the framework, and the weight matrix with a SparseTensor.

There are four problems with this approach: First, it wastes memory due to tensorflows requirement that all indices be given as 64-bit integers. Second, completely unstructured sparsity makes efficient implementations challenging, especially on GPUs. Third, the tensorflow operations cannot exploit the sparsity in the gradient signal that arises naturally when training with hinge-like losses. Finally, replacing the dense layer with a highly sparse layer results in underfitting. We will address these problems below.

3.1 Efficient 32-Bit Indexing

In tensorflow, sparse tensors are represented in coordinate (Coo) format (Fig. 1a), which means that each structural nonzero in a sparse matrix is described by three numbers. Two 64-bit integers define the row and column of the structural nonzero, and a 32-bit floating point number its value. This means that a single sparse weight requires as much memory as five weights in the dense matrix.

Even for extreme-scale classification, however, 32-bit integers would be more than sufficient as column and row indices of \({{\textbf {W}}}\). A maximum representable value of around 4 billion is still an order of magnitude larger than even very large scale proprietary problems [19] with 100 s of millions of labels, and three orders of magnitude larger than publicly available benchmark datasets.

Fig. 1.
figure 1

Schematic depiction of different sparse matrix formats. Note that in Coo format (a), the indices array in Algorithm 1 is of shape \(2 \times \text {nnz}\). In uniform format (c) it is \(\text {nnz per column} \times \text {labels}\), and hence only half as big, compared to the Coo format, for the same number of nonzeros.

3.2 Compressed Indexing and Equitable Work Distribution Through Constant Fan-In

Even with 32-bit indices, a sparse weight still consumes three times as much memory as a dense weight, when represented in coordinate format. This could be made much more efficient by switching to compressed sparse column (Csc) format, where only row indices are saved directly, and for each column only the offset of its first index is stored (Fig. 1b). While this drastically reduces the amount of memory needed to store the indices, it also increases the complexity of involved computations. For example, in Coo format, one can assign each GPU thread to the same amount of structural non-zeros to handle during the matrix multiplication, as getting the corresponding row and column indices is a simple array lookup. In contrast, in Csc format, it is still trivial to assign one column to each thread (i.e., each thread will compute one output), but that can lead to a significant difference in the amount of work each thread has to do, and thus lead to inefficient use of GPU resources. Furthermore, redistribution becomes more involved, as inserting a new structural nonzero in an early column means that all the weights and indices that come after have to be shifted.

This can be simplified if we stipulate that each column should have the exact same amount of structural non-zeros, such that \(\forall j: \Vert \boldsymbol{w}_j\Vert _0 = s\). Then, a single index array is sufficient, and the starting offset of each column can be calculated simply by multiplying the number of non-zeros per column with the column index, like in regular multidimensional array indexing (Fig. 1c). Distributing a multiplication with a constant fan-in sparse matrix across many threads is also easy, as we can simply assign one column (i.e., \(\boldsymbol{w}_j\)) to each thread, knowing that they correspond to the same amount of work. Finally, connection redistribution is cheaper, because the number of non-zeros stays constant for each column, and thus changes in one column never require moving around the data of other columns. As we will show in Sect. 4.2, the additional constraint on the number of connections per output does not negatively influence the models predictive performance in the overparametrized regime.

Broadly, the implementation works as follows: The sparse weights are represented by two matrices, \(\texttt {indices} \in {\mathbb {N}}^{s \times {m}}\) and \(\texttt {weights} \in \mathbb {R}^{s \times {m}}\). The input is given as a matrix \(\texttt {features} \in \mathbb {R}^{{b}\times {e}}\), where \({b}\) denotes the batch size, and the output is a matrix \(\texttt {output} \in \mathbb {R}^{{b}\times {m}}\). CUDA threads are generated on a two dimensional grid, with one thread for each output. Thus, threads will be indexed by pairs, each of them consisting of \(\texttt {instance} \in [ {b} ]\) and \(\texttt {label} \in [ {m} ]\). Every thread performs the calculations given, schematically, in Algorithm 1.

figure a

3.3 Speeding up Backward Pass Through Implicit Negative Mining

Our experiments with a sparse last layer showed that the largest fraction of time was spent in the backward pass. This is not surprising, as the backward pass requires two sparse matrix multiplications: to calculate the gradient with respect to the inputs, and to calculate the gradient with respect to the weights.

Fortunately, certain margin-based losses can induce high amounts of sparsity in the gradient of XMC problems, which can be exploited to ensure considerable speed-up [27, 31]. In the given enormous label space, each instance will have only a tiny subset of labels which are relevant to it, and many for which the decision that they are not relevant is “easy”. Thus, if the loss function gives zero penalty for these easy classifications (e.g., if the margin is large enough in hinge-like losses), then the error term to be back-propagated will be highly sparse. For the loss function that is mainly used in this paper, the squared-hinge loss \(\ell (y, \hat{y}) = \max (0, 1 - y \hat{y})^2\), the gradient is \(\partial \ell / \partial \hat{y} = -2y \max (0, 1 - y \hat{y})\), and thus exactly zero whenever \(y \hat{y} \ge 1\).

Therefore, in the backward kernel, it becomes beneficial to explicitly check whether the backpropagated signal \(\partial \ell / \partial \hat{y}\), denoted by \(\texttt {backward} \in \mathbb {R}^{{b}\times {m}}\) in the algorithm, is already zero, and if so to skip the corresponding operations. In particular, this means not only that the multiplication with zero can be skipped, but also makes it unnecessary to load the second operand and to store the result. As sparse matrix operations are memory-bound, this can be highly beneficial.

In fact, if we distribute the threads in the same way as the forward pass for the calculation of the gradient with respect to the features (one thread assigned for each label and instance) then most threads can be skipped entirely.Footnote 3 A schematic of the resulting implementation is given in Algorithm 2. Because multiple labels can contribute to the gradient of each input feature, in this case several threads need to update the same part of the gradient array. Therefore, we have to resort to using atomic addition operations here.

figure b

For calculating the gradient of the weight values, it is possible to arrange threads so that they can act independently, by using one thread for each gradient entry, i.e., for each \(\texttt {label} \in [ {m} ]\) and \(\texttt {weight}\_\texttt {idx} \in [ s ]\). In this case, one cannot skip entire threads, but a zero in the backward signal still allows to skip the unpredictable, indirect memory lookup of feature = features [instance, source], as shown in Algorithm 3.

figure c

3.4 Mitigating Underfitting by Adding an Intermediate Layer

Finally, we noticed that—even without constant fan-in—replacing the dense layer with a sparse layer results in diminished classification accuracy, which we attribute to underfitting. Thus, we propose to improve the expressiveness of the model by adding an intermediate layer between the embedding layer and the final classification layer. Because the last layer is sparse, its memory consumption is independent of the size of the preceding layer. Consequently, as long as this new intermediate layer is at least an order of magnitude smaller than the number of labels, this does not impede our goal of reducing memory requirements.

4 Experiments

In this section, we provide the experimental evidence showing that sparse last layers are a viable approach to extreme multilabel classification. We run experiments with several well-known benchmark datasets, measuring duration and peak GPU memory consumption, as well as \(\text {P}@k\). After presenting results that justify the architectural choices we made, we provide additional data illustrating the trade-offs between memory consumption and classification accuracy by varying the sparsity and size of the intermediate layer. Then we present investigate the effect of implicit negative mining. The section concludes with a discussion of the results. Additional experiments are given in the supplementary at https://github.com/xmc-aalto/ecml23-sparse.

4.1 Experimental Setup

In this paper we focus on the setting of learning from fixed, low-dimensional representations of the instances. This enables us to do many more experiments than if we had to fine-tune an expensive transformer-based encoder for each run.

We use two different sources for the embeddings: 512-dimensional fast-text based representations as used for Slice [12], and the final classification embeddings from a trained CascadeXML [15] model with 768 dimensions. We present results on two datasets, [33], Amazon-670k [17], and Wikipedia-500k [6].

To update the network’s weights, we use the Adam optimizer [16] with an initial learning rate of \(1 \times 10^{-3}\) that is decayed by 1/2 whenever validation \(\text {P@3}\) stops improving, until reaching \(1 \times 10^{-4}\). After that, training is stopped once \(\text {P@3}\) stops increasing. For sparse layers, we initialize the connections uniformly randomly, potentially subject to the constraint that each label gets the same amount of connections. Every 1000 training steps, each consisting of 32 samples in a minibatch, the 10% lowest-magnitude weights are randomly redistributed. In order to mitigate overfitting, we apply dropout to the input features, dropping 10% for Amazon-670k and Wikipedia-500k-Slice features, and 20% for Wikipedia-500k-Cascade.

The experiments are run on a Nvidia V100. Even though we want to demonstrate the feasibility of XMC learning on a commodity GPU, in order to be able to make meaningful comparisons, we have to train on the same GPU for all settings, which means that the GPU needs to have enough memory to fit in a dense last layer. To quantify the memory benefits of sparse training, we record the peak memory consumption as reported by tensorflow (). Note, in particular, that all cases with our proposed architecture consume significantly less than 4 GiB of GPU memory, and thus will be feasible, albeit training more slowly, on cheap gaming GPUs.

4.2 Results with Varying Architecture

As a first step, we want to show that the architectural choices described in Sect. 3 are useful. To that end, we compare the training with a dense last layer to the following settings:

  • A single, unstructured sparse layer,

  • A single, constant fan-in sparse layer,

  • An intermediate, dense layer, followed by an unstructured sparse layer,

  • An intermediate, dense layer, followed by a constant fan-in sparse layer.

The number of structural non-zeros is chosen such that in the Unstructured sparse layers, there are an average of 32 connections per label, and in the Constant-Fan-In sparse layers there are exactly 32 connections per label. As a baseline with comparable memory consumption, we also trained a Bottleneck architecture, that maps the input representation to a low-dimensional space of only 64 dimensions, before projecting into the label space.

Table 1. Comparison of different network architectures. Con denotes the (average) number of connections per label, Int the intermediate layer’s size, Mem the peak GPU memory consumption, Eps the number of training epochs, and Time the duration of a single epoch in seconds. Bold marks the best results in any sparse setting.

We repeated each experiment five times and report the average, expect for the extremely slow settings with unstructured sparsity, which we ran only once.

The results of these experiments are presented in Table 1. Several facts are immediately obvious from the recorded data: First, the naive, tensorflow-based implementation for Unstructured sparsity is very slow, to the degree that the sparse matrix multiplication ends up being 2-\(3\times \) slower than dense multiplication on the large datasets. Second, with the intermediate layer, the classification performance of constant fan-in and unstructured sparsity is almost identical. Third, without an intermediate layer, there is a significant drop in \(\text {P@3}\), both in training and test performance, showing that naïve sparsification leads to severe overfitting. In cases where the unstructured sparse layer fails to perform well even on the training set, the additional constraint does lead to a further drop in performance. The Bottleneck baseline outperforms the direct sparse layer, but is significantly weaker than the combination of sparse and intermediate layer.

The measurements further show that for training based on Slice features, the sparse implementation manages to attain and slightly surpass the classification performance of the equivalent dense layer, whereas for Cascade features there still remains a noticeable gap between dense and sparse training. As a first possible explanation, one might argue that Cascade features have been specifically trained so that they work well with a linear extreme classification layer, whereas Slice are more general features. Therefore, it is not the sparse realizations that perform better, but instead the dense setting that performs disproportionately worse for Slice features, as it does not have the benefit of the additional intermediate layer that allows non-linear classification boundaries. This argument does not hold up, though, as both features result in comparable model performance on the training set—it is the generalization gap that is much increased with Slice features.

Looking at the memory consumption, we can see that sparsification of the last layer does lead to a noticeable reduction, but only becomes really effective if we use our implementation of constant fan-in sparsity. In this case, the memory consumption reduces to between one third and on tenth of the dense equivalent.

4.3 Results with Varying Network Size

In Table 2, we demonstrate the effect of varying the number of connections per label, and the size of the intermediate layer, for the uniformly sparse setup. Unsurprisingly, increasing the network size results in improved classification performance. For Slice features, the sparse network can be considerably better than the dense counterpart. For Cascade features, increasing the size of the sparse layer provides a way of shrinking the gap between sparse and dense performance, while still remaining much more memory efficient than the dense setup. In particular for Wikipedia-500k, the change in memory consumption is only by a few percent, while the improvement in \(\text {P}@k\) is substantial. Except for Amazon-670k with Cascade features, increasing the model size results in reducing the number of training epochs.

The data also shows a clear qualitative difference between Amazon-670k and Wikipedia-500k: For Amazon-670k, switching from dense to sparse does not lead to a noticeable decline in the ability of the classifier to fit the training set, whereas for Wikipedia-500k the drop is dramatic, especially in the case of Slice features. This suggests that for the smaller Amazon-670k (490 449 instances), even the sparse architectures are overparametrized enough to interpolate the training set, whereas for Wikipedia-500k (1 813 391 instances), this is no longer the case, especially for the smaller sparse models.

Table 2. Train and test \(\text {P}@k\) on Amazon-670k with varying sparsity and intermediate-layer size, relative to dense performance. Results of a single run.

4.4 Quantifying the Effect of Implicit Negative Mining

Next, we show that the implicit negative mining effect discussed above can have a significant impact on the speed of training. To that end, we use the small model configuration with constant fan-in sparsity with 32 structural non-zeros per output and 16k intermediate units, and train it once using the squared hinge loss (Sqh) and once using binary cross-entropy (Bce) loss function. As the Bce loss only goes to zero asymptotically, this means that there will not be many explicit zeros in the signal being back-propagated through the sparse layer, and thus all labels have to be processed.

As shown in Table 3, this has a strong effect on the training time per epoch: The implicit negative mining with Sqh reduces the duration by about one third. Additionally, the squared hinge loss results in slightly better \(\text {P}@k\), and fewer training epochs.

Table 3. Comparison of training with square hinge loss and binary cross-entropy.

4.5 Discussion

The results above show that sparsification of the extreme layer is possible without a strong decrease in classification performance, relative to a dense layer. However, it has to be noted that training the dense layer in the common experimental protocol employed here yields worse results than reported state-of-the-art for the same set of features. Thus, even in cases where the sparse architecture outperforms the dense layer, reported results from the literature are still better.

In Table 4, we present the results from Slice [12] and Cascade [15], compared against our largest setting with 64 nonzeros per label and 65k intermediate units. Compared to these methods, ours performs up to 4% worse, trading off a little classification accuracy versus a multifold reduction in memory consumption. For example, Cascade runs for over a day on two NVidia A100 GPUs.

Table 4. Comparison of sparse results with state-of-the-art.

5 Conclusion and Outlook

In this paper, we have shown that it is possible to replace an extreme-scale dense classification layer with a memory-efficient sequence of an intermediately-sized layer followed by a constant-fan-in sparsely connected layer, without a strong drop in classification performance, and in some cases even improved \(\text {P}@k\).

The experiments performed so far investigate sparse layers in the context of a simple training procedure: Learning with the full label space, from fixed, pre-trained features. To achieve feature-parity with existing approaches, this needs to be extended to allow for end-to-end training, where the featurizer \({\psi }\) is learned jointly with the classifier. Secondly, even though the implicit negative mining effect allows to reduce the computation for the backward pass to be sub-linear in the overall number of labels, it still requires a full forward pass. In order to get to competitive training times, one thus has to integrate also explicit negative mining into the training pipeline. Finally, the datasets used in this work still do not exceed millions of labels.

We performed some initial experiments using Amazon-3M [18], which indicate a decrease of memory consumption from 63 GiB to 12 GiB, at the cost of about \(5\%\) decrease in precision. While this is still too much memory consumption for cheap gaming GPUs, it is still well within the parameters of common workstation units. A more thorough investigation of this dataset is planned for future work.

We believe that this paper provides a good foundation, from which these goals can be achieved: First, by having the sparse multiplication implemented as a regular tensorflow layer, it can be readily included in a more general model, and automatic differentiation will ensure correct gradient calculations. Second, because we are constraining the sparsity to have constant fan-in, selecting a subset of labels for which scores shall be calculated becomes a trivial matrix slicing operation, similar to the fully-connected case. In follow-up works, we aim to incorporate our approach into existing end-to-end deep extreme classification frameworks while benefiting from explicit negative mining. Furthermore, from a statistical perspective, it is possible that constant-fan-in sparsity also leads to a better coverage of tail-labels, and improvements in the corresponding metrics [13, 28], which should be investigated.