DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Crowd counting provides an important foundation for public security and urban management. Due to the existence of small targets and large density variations in crowd images, crowd counting is a challenging task. Mainstream methods usually apply convolution neural networks (CNNs) to regress a density map, which requires annotations of individual persons and counts. Weakly-supervised methods can avoid detailed labeling and only require counts as annotations of images, but existing methods fail to achieve satisfactory performance because a global perspective field and multi-level information are usually ignored. We propose a weakly-supervised method, DTCC, which effectively combines multi-level dilated convolution and transformer methods to realize end-to-end crowd counting. Its main components include a recursive swin transformer and a multi-level dilated convolution regression head. The recursive swin transformer combines a pyramid visual transformer with a fine-tuned recursive pyramid structure to capture deep multi-level crowd features, including global features. The multi-level dilated convolution regression head includes multi-level dilated convolution and a linear regression head for the feature extraction module. This module can capture both low- and high-level features simultaneously to enhance the receptive field. In addition, two regression head fusion mechanisms realize dynamic and mean fusion counting. Experiments on four well-known benchmark crowd counting datasets (UCF_CC_50, ShanghaiTech, UCF_QNRF, and JHU-Crowd++) show that DTCC achieves results superior to other weakly-supervised methods and comparable to fully-supervised methods.


Introduction
Crowd counting is an important topic in the field of crowd analysis: the aim is to estimate the number of people in an image. With increasing population and urbanization, there are more and more crowdcontaining localities: e.g., subway platforms, bus stations, airports, tourist attractions, and shopping malls. Crowd congestion can occur during peak hours, with a serious negative impact on public safety. Accurate crowd counting can help to avoid crowd congestion, and plays an essential role in public security, abnormal situation warning, and pedestrian control.
Significant progress has been made in crowd counting via computer vision through years of relevant research. As Fig. 1 shows, existing crowd counting methods can be classified as depending on object detection, density estimation, point-supervision, and weak-supervision. Deep learning-based methods can also be divided into CNN-based and transformerbased methods. In an earlier study, researchers used object detection to solve the crowd counting problem [1,2]. However, such methods do not work for dense scenes: severe occlusion and complex backgrounds typically occur in such cases, leading to unsatisfactory results. To solve these problems, some regressionbased approaches have appeared. They usually learn low-level features (e.g., texture features, edge feature, etc.) using traditional algorithms and map features to the number of persons in the crowd through regression models. However, these methods ignore crowd distribution information in the image. To make use of it, Lempitsky and Zisserman [3] proposed a method based on density estimation which learns a linear or nonlinear mapping between image features and density maps. Nonetheless, the features extracted by traditional methods cannot capture deep-level feature representations. Therefore, Walach and Wolf [4] and others [5,6] have used CNN-based approaches to regress density maps. The powerful image feature extraction capability of CNNs enables these methods to achieve better results. Nowadays, CNN-based methods have become mainstream for dense scenes.
Recent CNN-based fully-supervised methods [7][8][9] achieve excellent results; they require both a count and annotation of individual people as supervision. These methods generate the true density map from individual annotation and regress the predicted density map. Nevertheless, detailed individual labeling is tedious, limiting its application. Therefore, it is fundamental to find a method that can obtain precise results simply using crowd counts as annotations. Corresponding deep learningbased weakly-supervised methods have thus emerged [10,11]. However, these existing weakly-supervised methods usually ignore the extraction of global receptive fields and multi-level information; they predict the total count directly from the entire image, so the global receptive field is important for these methods. A CNN is limited to extracting a global receptive field without using a density map due to the characteristics of local feature extraction. In 2021, a transformer was introduced to the weaklysupervised crowd counting task [12]. The global attention of the corresponding network can effectively overcome the limited receptive field of CNN-based methods. However, this work cannot effectively extract multi-level information about the target. Figure 2 shows an image with targets of different sizes in the two regions marked in red. Thus, for weaklysupervised crowd counting, multi-level information is very important. Sufficient features cannot be learned to regress counting if multi-level information is not properly utilized.
This paper proposes DTCC, a pyramid vision transformer network for weakly-supervised crowd counting. It comprises a transformer feature extraction module and a multi-level dilated convolution regression module. The main contributions of this paper are: (1) DTCC, a multi-level transformer dilated convolution weakly-supervised framework, which is capable of accurate end-to-end crowd counting. (2) A multi-level crowd information feature extraction module for dense prediction. The final feature representation can distinguish between dense crowd heads and larger scale crowd heads. The overall framework is a recursive pyramid structure, which combines a pyramid vision transformer backbone network and a fine-tuned recursive pyramid structure (recursive fine-FPN) to obtain multi-level contextual crowd information. (3) A multi-level dilated convolution regression module, which can enhance the receptive field for features and capture stronger global features. It is combined with two networks, DTCC-Dynamic and DTCC-Mean, as multi-level regression heads adapted to different crowd scenes. (4) Experiments on four well-known benchmark datasets demonstrating better accuracy than other weakly-supervised methods, with competitive results to mainstream fully-supervised methods.

Background
Crowd counting approaches can be divided into two categories: fully-supervised and weakly-supervised methods. Fully-supervised methods use a density map as supervisory information to train the model, which requires point-level annotations of the crowd. Weakly-supervised methods only need a count of the crowd. Mainstream crowd counting methods usually utilize CNN to regress a density map [23,26,44]. The success of transformer-based methods in computer vision tasks such as image classification [13][14][15], object detection [13,16,17], and image segmentation suggests use of a transformer framework as the backbone network for crowd counting.

Fully-supervised crowd counting
CNN-based methods regress a density map and obtain the total number of people in the crowd by integrating the density map. Zhang et al. [18] proposed a network with three differently-sized receptive fields, which was capable of learning multi-level crowd features. This method replaces the fully connected layer by a convolution layer and can modify the size of the input images. Sam et al. [19] proposed a selective CNN with several convolution kernels of different sizes as the density map regression head; a selection classifier selects the optimal regression head for the input to predict the result. Li et al. [20] presented a deeper framework with convolution layers as the backbone, based on a combination of VGG16 and dilated convolution layers to expand the receptive field. It extracted deeper features without losing resolution. Later advances considered new density map loss functions with better results: Ma et al. [21] illustrated a point-supervised loss function for crowd count estimation, converting a sparse point labeling into a ground truth density map using a Gaussian kernel. This was used as a learning target to train the density map estimator. Liu et al. [22] used a swin transformer as the backbone network and a top-down fusion mechanism to fully utilize the various spatial information extracted from different stages of the model. Abousamrad et al. [24] reported a method that uses topological constraints instead of binary region maps to compute L2 loss functions for the head and background region. Only a few methods are based on transformer networks to realize fully-supervised crowd counting. Among them, Liang et al. [23] proposed an elegant, endto-end crowd localization transformer that solves the task using a regression-based paradigm. Sun et al. [25] investigated the role of global contextual information in crowd counting. This method extracts global information from overlapping image blocks using a transformer, and adds contextual tags to the input sequence. In addition, a token attention module and regression token module are proposed to predict the total number of people in images. Gao et al. [26] showed a dilated convolutional transformer method, introducing a window-based vision transformer for crowd counting.
In summary, fully-supervised crowd counting methods have been extensively studied and have achieved good results. However, the application of fully-supervised methods to specific scenes is very limited, because they require individual annotation to generate density maps, and it is tedious and difficult to perform accurate individual annotation for dense scenes.

Weakly-supervised crowd counting
Weakly-supervised counting methods just rely on crowd counts for training. Shang et al. [27] proposed an end-to-end CNN architecture that exploits shared computation over overlapping regions. Wang et al. [28] presented a novel and efficient counter, which explores embedded global dependency modeling and total count regression by designing a multi-granularity regressor. Lei et al. [29] suggested a new multiassisted task training strategy, MATT, which learns from a few images with individual annotations and many simply with counts to obtain more accurate predictions. Transformers have an inherent advantage in weakly-supervised crowd counting, since they can enhance global information about features and capture contextual knowledge. TransCrowd [12] was the first transformer-based crowd counting framework, which reformulates the counting problem from a sequential perspective to a counting perspective. CCTrans [31] is applicable to both fully-supervised and weakly-supervised data, and uses Twins [32] as a feature extraction framework. It combines the features of multiple stages of the Twins network through multi-level dilated convolutions for feature fusion, finally predicting the number of people through a regression head. Savner and Kanhangad [52] proposed an architecture based on a pyramid vision transformer network to extract multi-scale features with global context. Wang et al. [53] proposed a joint CNN and transformer network based on weakly-supervised learning to reduce the number of parameters and overcome the problem of target segmentation.
Without annotations of individuals, weaklysupervised crowd counting is challenging. Existing weakly-supervised methods cannot extract sufficient global features and multi-level information, leading to the loss of collective semantic information and a failure to provide rich global features for the final regression. Using a global attention mechanism provides a new way to design an effective weaklysupervised crowd counting model.

Approach
Existing weakly-supervised crowd counting methods have two problems to be solved: extraction of a global receptive field and utilization of multi-level information. Therefore, this paper introduces a swin transformer to capture global features. A feature pyramid structure is also introduced to enrich the multi-level feature representation, so that single-level features contain rich multi-level information. In addition, since the window attention mechanism of the swin transformer processes image patches, this alleviates the problem of uneven distribution of the crowd to a certain extent. To enhance the receptive field of features, a multi-level dilated convolution module is designed for the swin transformer, to solve the problem of local domain loss by dilated convolutions. Based on the above ideas, we propose DTCC, an end-to-end weakly-supervised method for crowd counting, which can provide accurate crowd counts based only on crowd count annotations.

Network architecture of DTCC
The framework of DTCC is shown in Fig. 3. The input image is divided into blocks of the same size and converted into a 1D sequence for the swin transformer. DTCC is composed of two main modules. The recursive swin transformer feature extraction module consists of a swin transformer [13] and the recursive fine-FPN. The multi-level dilated convolution regression head module consists of a multi-level dilated convolution and a linear regression head. The counting results from multi-level feature regression are given different weights to obtain the final count.
For feature extraction, the swin transformer is composed of a transformer-encoder. Therefore, the 2D image structure must be converted to a 1D sequence required as input to a transformerencoder. This network is commonly used in natural language processing, but can also get good results in computer vision by using ViT [30] to solve the input problem.
The input image to the swin transformer is defined as X ∈ R H×W ×C , where H, W , C represent the height and width of the image and the number of channels, respectively. Firstly, the image is divided into image patches of size P × P . Thus the image of size H × W × C is divided into patches X ∈ R N ×P ×P ×C , where N = (H/P )×(W/P ). Then, each image patch is linearly transformed into a sequence of length P 2 × C for input to the model. Therefore, the input image is transformed into Z ∈ R N ×T by preprocessing, where T = P 2 × C. The feature extraction backbone network of swin transformer calculates local window attention, and performs image embedding operations in the window for each image patch. It uses two operations for dividing windows. The image is first divided into image patches, and then the image patches are divided by moving windows. This method moves non-overlapping local windows, which reduces computational complexity to linear in image size.

Approach
The recursive swin transformer (RST) effectively combines the pyramid structure transformer backbone with the recursive fine-FPN. For crowd counting in dense prediction tasks, the swin transformer has a pyramid structure similar to a CNN which can extract multi-level feature representations of images. The self-attention mechanism of transformer solves the disadvantages of local feature extraction in CNNs and can capture stronger global features. In addition, the window attention mechanism of swin transformer is executed on image patches, which alleviates the problem of uneven distribution of the crowd to a certain extent. Recursive fine-FPN iteratively fuses multi-level features to observe multiple views of the image, and produces richer feature representations for the regression head.

Transformer backbone network
The transformer backbone network inputs Z ∈ R N ×T to the transformer encoder and uses a multi-headed attention mechanism to extract features (since the visual task does not require the encoding part, only the decoding part is utilized). The transformer encoder consists of multi-head self-attention (MSA) and MLP layers, while each layer uses residual connections and layer normalization (LN). The overall process is given by where Z l−1 is the output of MSA. Self-attention is the most important contribution of the transformer. The attention mechanism can assign different weights to input information when aggregating information. Briefly, the mechanism can learn the attention between a sequence and other sequences, which is a weight matrix from an operational point of view. There are three concepts in attention: the query (Q), the key (K), and the value (V ). Each sequence outputs Q, K, and V by multiplying by the W Q , W K , and W V matrices where K and V exist in pairs. The attention between different sequence pairs for each subsequence Q is where d is the size of the query and key. The original transformer structure also adds a positional encoding to provide location information. ViT [30] does not use the default fixed positional encoding and instead sets the positional encoding to a set of learnable 1D sequences. The position encoding used by swin transformer has two differences: the position encoding is different, and it is added to the attention matrix. Relative position information is used instead of absolute position information. The attention can be written as The attention mechanism in the transformer is multi-head attention, using h heads to compute attention. This allows the model to focus on different aspects of information. The input sequence Z ∈ Finally, this module concatenates the output from the h heads and obtains the final output using a linear transformer.
The swin transformer is a pyramid structure that can handle multiple levels as well as reducing complexity. This network consists of four transformer network layers to calculate local attention, with step sizes for the four stages given by P = [4,8,16,32]. The swin transformer combines two types of window division method which effectively captures both local and global attention.

Recursive fine-FPN
As Fig. 4 shows, we add a recursive feature pyramid structure [46] after the transformer; it is inspired by the idea of looking and thinking twice before acting. This network can deliver better semantic information through feedback connections in the structure of the fine-FPN. Multi-level features are extracted and fed back to the transformer backbone layer to realize bottom-up connections for the corresponding network layer. It is important to note that our method uses a recursive feature pyramid network to fuse the features from different stages, which differs from the winner of ICCV-VisDrone [22]. This architecture can look at images twice or more, so can better observe detailed information in a dense crowd image.
The fine-FPN solves the multi-level problem in crowd counting tasks through simple network connections. The overall display is a bottom-up structure that integrates features at different scales. Each layer of fine-FPN first adjusts the number of channels using 1 × 1 Conv, and up-sampling features from the previous stage. Next, it performs fusion by a simple add operation, and finally the 3 × 3 Conv is used to eliminate blending effects. fine-FPN improves upon the original FPN: we up-sample the fused features after 3 × 3 Conv to give the higherlevel fused features, improving robustness. A twolevel recursive feature pyramid is used in this paper, defined as M = fine FPN(SwinT(M )) (5) where M is the output of the first stage fine-FPN and M is the final output. We use the same method to combine them as for combining the fine-FPN output and the output of the swin transformer.

Multi-level dilated convolution regression head
The density of people in images varies greatly in the crowd counting task, and images contain objects at different scales. Therefore, extraction of global features is an important foundation for weakly-supervised crowd counting. We use a multilevel dilated convolution regression head to enhance the receptive field of features. As Fig. 3(b) shows, the multi-level dilated convolution regression head (M-DRH) module consists of multi-level dilated convolution and multi-headed linear regression layers. Dilated convolution is commonly used in computer vision to collect contextual information without adding extra parameters, while widening the receptive field. The dilation rate represents the interval used in the convolution kernel. When the rate is equal to 1, the result is the same as ordinary convolution.
Using the output of multi-level features from RST, the M-DRH module performs mutli-level dilated convolution for various features. The dilation rate is inversely proportional to feature level: [2,3,4]. At the same time, the M-DRH module avoids the problem of local information loss resulting from dilated convolution; the swin transformer has four stages which can realize down-sampling to extract multi-level features as in CNN-based crowd counting methods. The down-sampling rate of each stage is 2, so the elements are selected at a row-and column-wise interval of 2. See Fig. 5: K is the next stage of output to K, and green shows where convolutional operations are performed while white means that no convolution is performed. Downsampling for the swin transformer merges features near four points. So, K 12 is composed of K (1−2)(3−4) and K 13 is composed of K (1−2)(5−6) . We perform dilated convolution with dilated rate of 2 and 3 for K and K, respectively. K 12 does not pass the operation of convolution and K (1−2)(3−4) has convolution operation. Although K (1−2)(5−6) pixel blocks are not the operation of convolution, but K 13 has convolution operation, and so on for the other parts. This mechanism ensures that our proposed M-DRH is able to avoid the problem of local feature loss by dilated convolution.
Since crowd images contain head information at multiple scales, different levels of crowd head feature maps have different advantages. Therefore, we use multi-level feature maps to regress results and a dynamic layer to learn optimal fusion parameters. Specifically, an activation function and linear layer overlay component are designed to perform regression on multi-level features simultaneously. We use two kinds of fusion mechanism. In Fig. 6(left), we add a dynamic layer of parameters to learn fusion weights for the three regression results; this layer contains three learnable parameters. In Fig. 6(right), we directly average the three regression results to get the final result.

Loss function
The number of people in dense scenes can be relatively large. However, the L1 loss function commonly used in related studies has fold points which can lead to  instability in the case of large counts. In this paper, the SmoothL1 [33] loss function is used to ensure smooth output and enhance robustness; it is less likely to cause gradient explosion. It is given by where x = p − D represents the difference between the predicted result p and the ground truth D.
The feature extraction backbone network outputs multi-level feature representations which solves the problem of target scale change in images. Therefore, two regression head fusion mechanisms are used in this paper. DTCC-Dynamic uses a dynamic network layer to automatically learn different fusion weights, with loss function: where the multi-level regression values are e 1 , e 2 , e 3 , and the output of the dynamic layer is defined as a, b, c. DTCC-Mean takes the mean of the predicted values, so the loss function of DTCC-Mean is

Overview
In this section, we evaluate DTCC using several public crowd counting datasets: ShanghaiTech, UCF CC 50, UCF QNRF, and JHU-Crowd++. We compare our results to those of both weakly-supervised and fullysupervised methods in Tables 1-3. In addition, results of ablation experiments conducted to evaluate each component of the proposed framework are shown in Tables 4-6.

Datasets
We used the following datasets: (1) UCF CC 50 [34] consists of 50 images in total, divided into training and validation sets in a ratio of 4:1. The dataset contains a small number of images and high density variation, with a maximum of 4633 people and a minimum of 96 people, with an average count of 1297.
(2) ShanghaiTechA [35] consists of 482 images in total, with 300 training images and 182 validation images. The images were randomly crawled from the Internet, so the images have a very wide range of sources. The images contain an average of 501 and a range of 33-3139 people. (3) ShanghaiTechB [35] consists of 716 images in total, with 400 training images and 316 validation images. These are real images from the streets of Shanghai, captured by road cameras. The images contain an average of 123 and a range of 9-578 people. (4) JHU-Crowd++ [36] consists of 4822 images in total, with 2722 training images, a validation set of 500 images, and a test set of 1600 images. This dataset has rich image information including count, person center coordinates, head frame coordinates, weather information, and lighting conditions. It can also be divided into three datasets according to the number of people contained: JHU-Low, JHU-Medium, and JHU-High. The images contain an average of 437 and a range of 2-7286 people. (5) UCF QNRF [51] consists of 1535 images in total, with 1201 training images and a validation set of 334 images. It contains real scenes from around the world, including buildings, vegetation, sky, and roads, which are important for counting crowds in different situations. The images contain an average of 815 and a range of 49-12,865 people.

Baselines and compared methods
In order to verify the effectiveness of our method, we choose a large number of comparator methods including mainstream fully-supervised and state-ofthe-art weakly-supervised methods. Fully-supervised methods need both person location and count annotations, and include CAN [37], ADSCNet [38], PACNN [39], S-DCNet [40], and P2PNet [41]. Weakly-supervised methods only need count annotations, and include that of MATT, TransCrowd, and CCTrans. In particular, TransCrowd and CCTrans also use a transformer as the backbone network for feature extraction.

Implementation details
We used the Swin-L model pre-trained on ImageNet-22K to speed up convergence of the model. For the backbone network of the swin transformer, the number of heads used was [6,12,24,48], the position embedding was a position bias matrix, the window size was 12, the number of layers was [6,12,24,48], and the number of channels in the hidden layer of the first stage was 192. In the training section, we strictly followed the input image size requirement of 384 × 384 for Swin-L. We used the same approach as TransCrowd [12]: we resize all original images to 1152 × 768 (landscape) or 768 × 1152 (portrait), then cropped each image into 6 blocks of size 384×384, and calculated the number of people in each image block by location annotation of the people in the image. We also utilized data augmentation strategies, such as random flipping and gray scaling. For compatibility with the operation of dividing each image into 6 image blocks, the training batch size was set to 24. All experiments were executed on a Linux system with an Intel E5-2620v4 Xeon CPU at 2.10 GHz and an NVidia P100 16 GB Tesla. The learning rate was set to 10 −5 initially and decreased to 10 −6 in the final epoch.
In the evaluation phase, we choose the widely accepted MSE and MAE as metrics: where N is the number of images, and P i and G i represent the i-th predicted count and ground truth, respectively. The MAE represents the mean absolute error, and is a very intuitive evaluation metric representing the distance between predicted value and ground truth. MSE better represents the stability of the model.

Comparison to existing methods
Our method, with alternatives DTCC-Dynamic and DTCC-Mean, shows good accuracy compared to other weakly-supervised methods. Table 1 gives errors for the UCF CC 50, ShanghaiTech, and UCF QNRF datasets. UCF CC 50 has only a few images, and they are all of dense crowds. Without increasing the amount of data, our method has made significant progress compared to other weakly-supervised methods. DTCC-Mean achieves better results than DTCC-Dynamic, showing that using an average fusion mechanism for the regression head can provide good accuracy and robustness given a small amount of data. The ShanghaiTech partA dataset comes from a wide range of scenes with large variations in crowd density, so accurately estimating the number of people is very challenging. Our proposed method DTCC-Dynamic achieves the best MAE. This indicates that our backbone swin transformer network can better adapt to different densities, while the dynamic fusion regression head can learn optimal ratios from a large number of datasets. However, the MSE metrics is less satisfactory. So, in the presence of anomalies, our method still needs to be improved. On ShanghaiTech partB, DTCC again achieves significant improvements over other weakly-supervised methods.
For the JHU-Crowd++ dataset, we conducted experiments on the validation set (Table 2) and test set (Table 3) separately. We divided the dataset into three count levels of low (0-50), medium (51-500), high (500+), and also aggregated total results. For the validation set, DTCC achieves the better results than other weakly-supervised methods, and shows competitive results when compared to mainstream fully-supervised methods. To further demonstrate the effectiveness of the proposed DTCC, we conducted further experiments on the test set using the pretrained model parameters of JHU-Total. In this case, for the JHU-Low dataset, our proposed method achieves competitive results, differing slightly from the state-of-the-art weakly-supervised method. On JHU-Medium, our method achieves the best estimates compared to other weakly-supervised methods; it shows a strong advantage for this higher density dataset. On JHU-High further improvements are seen. The proposed dynamic fusion mechanism achieves better results: dynamic learning parameters provide good fault tolerance for ultra-dense scenes. JHU-Total contains all horizontal images, and the density range of the dataset is large, which requires a robust model. The good improvements to MAE and MSE, show that our method has not only high accuracy but also good stability.

Visualization of feature maps
To verify the effectiveness of our method, we visualized feature maps on ShanghaiTech PartA using heat maps. As noted, each image is split into six sub-images to input into the model, which can be seen in the visualization. Figure 7 shows that our method pays more attention to dense image regions and adapts to different scenes. It can also be seen that there are a few areas incorrectly given attention due to the lack of individual person annotations.

Computational cost
To evaluate the all-round performance of DTCC, we calculated the number of parameters and GFlops consumed by the model. Table 4 compared the results with other mainstream full-supervised and weakly-supervised crowd counting methods. DTCC consumes more computing resources than other methods, because it uses the swin transformer as its backbone network. Using a recursive swin transformer further increases the cost. However, the cost of our work is still of the same order of magnitude as that of other works, with no major impact on practical applications.

Ablation experiments and tuning
We conducted various ablation experiments on the DTCC-Dynamic version using the ShanghaiTech dataset to verify the contribution of each module and to justify the reasoning behind it, and to tune operation. Table 5 shows results of an ablation experiment on Fig. 7 Visualization of feature maps. use of the multi-level dilated convolution regression module. By comparing the results with and without M-DRH, we can see that introducing the multilevel dilated convolutional regression head improves counting accuracy. This justifies our assumption that multi-level feature relationship modeling can capture different scales of crowd information in images with dense crowd scenes. In addition, the presence of the dilated convolution can enhance the global receptive field, which is important for weaklysupervised methods in crowd counting.

Multi-level dilated convolution regression head
We conducted further experiments to assess different choices of dilation rate. As Table 6 shows, optimal results were obtained by setting the dilation rates to [2,3,4]. Using too small dilation rates does not enhance the receptive field of features enough, while using too large dilation rates may lead to loss of local features. As a compromise, we set the dilation rates to [2,3,4].
We also performed experiments with different multilevel features for weakly-supervised methods, as reported in Table 7. Dilation rates 1, 2, 3 represent feature maps with resolutions of 12 × 12, 24 × 24, and 48 × 48, respectively. Adding successive feature maps of different resolutions improves the model's results significantly, demonstrating that multi-level features are important to our method.

Recursive fine-FPN
In Table 8, a baseline of DTCC-Dynamic was used; it compares results of using the baseline with recursive FPN, and using the baseline with recursive fine-FPN. We can see that using fine-FPN achieves better results than the original FPN. This indicates that for crowd counting, fusion of deep features upsampled by 3 × 3 Conv can provide better performance. We performed further experiments on the pyramid structure. The baseline was DTCC-Dynamic method without any pyramid structures. Table 9 compares results of three sets of experiments, using baseline, baseline with fine-FPN, and baseline with recursive fine-FPN. Addition of the pyramid structure effectively improves the accuracy of the model; the recursive pyramid structure achieves the best accuracy. Due to the features at the last level of the transformer output, pixel information is easily lost in the process of increasing the step size for patches. Using the recursive fine-FPN causes extra feedback connections from fine-FPN to be incorporated into the bottom-up backbone layers, allowing all levels of feature maps to have strong contextual information.

Loss function
We separately evaluated the commonly used L1 and SmoothL1 [33] loss functions. From the results in Table 10, it can be concluded that SmoothL1 gives better results. SmoothL1 is more stable and can adapt well to both large and small errors.

Conclusions
This work proposes a pyramidal vision transformer network for weakly-supervised crowd counting; it can achieve end-to-end crowd counting. A multilevel feature extraction module and a multi-level dilated convolutional regression module are designed for dense prediction tasks; they can better capture global features and generate more reasonable features for weakly-supervised crowd counting. Extensive experiments on four well-known benchmark datasets demonstrate that DTCC achieves superior counting performance compared to other mainstream weaklysupervised methods and is competitive with some fully-supervised methods.
In future, we plan to further investigate a more concise feature extraction backbone network for crowd counting, and design a better regression head for prediction. In addition, we also intend to further extend DTCC to other dense prediction scenarios, such as traffic counting for intelligent transportation.