MSD-NAS: multi-scale dense neural architecture search for real-time pedestrian lane detection

Accurate detection of pedestrian lanes is a crucial criterion for vision-impaired people to navigate freely and safely. The current deep learning methods have achieved reasonable accuracy at this task. However, they lack practicality for real-time pedestrian lane detection due to non-optimal accuracy, speed, and model size trade-off. Hence, an optimized deep neural network (DNN) for pedestrian lane detection is required. Designing a DNN from scratch is a laborious task that requires significant experience and time. This paper proposes a novel neural architecture search (NAS) algorithm, named MSD-NAS, to automate this laborious task. The proposed method designs an optimized deep network with multi-scale input branches, allowing the derived network to utilize local and global contexts for predictions. The search is also performed in a large and generic space that includes many existing hand-designed network architectures as candidates. To further boost performance, we propose a Short-term Visual Memory mechanism to improve information facilitation within the derived networks. Evaluated on the PLVP3 dataset of 10,000 images, the DNN designed by MSD-NAS achieves state-of-the-art accuracy (0.9781) and mIoU (0.9542), while being 20.16 times faster and 2.56 times smaller than the current best deep learning model.


Introduction
Visual impairment is a disease that can affect the quality of life significantly.In 2015, around 36 million people globally suffered from blindness.This figure is estimated to reach 115 million by the year 2050 [1].For vision-impaired people, accurately detecting the pedestrian lane is an essential criterion to navigate freely and safely.Currently, this task is performed using manual aids that are prone to errors, such as white canes and guide dogs [16].Hence, there is a need for automatic pedestrian lane detection methods that are robust, accurate, and fast.
The early methods for automatic pedestrian lane detection use traditional image processing techniques.These methods are generally unsuitable for real-time systems because they are slow and ineffective.Several methods relied on white markers surrounding the pedestrian lanes, e.g., [20,40].This approach is ineffective because most pedestrian lanes are unmarked, and have arbitrary shapes and surfaces.Some methods relied on manually-extracted features or vanishing point estimation, e.g., [15,30,34].These approaches are not robust because they are sensitive to scene variations.
The more recent automatic pedestrian lane detection methods use deep learning (DL) techniques, where the pedestrian lane detection task is cast as a two-class semantic segmentation problem (0: background, 1: pedestrian lane).A recent survey of pedestrian lane detection methods demonstrates that the DL-based segmentation methods can detect pedestrian lanes accurately [21].However, they still lack practicality due to non-optimal accuracy, speed, and model size trade-off.Based on the survey, the most accurate DL method is a large DNN, named Multiscale HRNet [27,38], and the fastest DL method is a DNN automatically designed by a neural architecture search method, called Fast-NAS [4].
The Multiscale HRNet is accurate (mean intersection over union of 0.9568), but it has a slow processing time (2.6 frames per second).In contrast, the DNN derived by Fast-NAS is fast (142.86 frames per second), but it is slightly inaccurate (mean intersection over union of 0.9229).Hence, a suitable DNN for real-time pedestrian lane detection is still needed.
Designing a suitable DNN manually for pedestrian lane detection is challenging.There are many design considerations, including the number of layers, the connection between layers, and the operation at every layer.A poorly designed network will lead to low performance.Additionally, the DNN has to be accurate, fast, and compact for practicality.To this end, NAS is a promising approach because it can find a suitable network automatically based on some criteria, thus reducing architecture engineering effort significantly.
This paper proposes a new NAS algorithm, named Multiscale Dense NAS (MSD-NAS), that can automatically design a fast, accurate, and compact DNN for the pedestrian lane detection task.The search space of MSD-NAS is a large and dense search space that also contains many existing hand-crafted DNNs as candidates.Additionally, the network designed by MSD-NAS supports multi-scale inputs, allowing it to utilize both local and global contexts for predictions.To further improve the detection performance, we introduce a novel Short-term Visual Memory mechanism to improve information facilitation in the derived network.
The contributions of this paper can be highlighted as follows: 1. We propose a new neural architecture search algorithm, called MSD-NAS, to automatically find the optimum DNN with multi-scale input branches for pedestrian lane segmentation.The capability of MSD-NAS is demonstrated via extensive analysis and experiments.2. We introduce a novel NAS search space that is generic and large.The search space is represented as a Generalized Segmentation Network (GSN).GSN has multi-scale input branches, allowing the search algorithm to select the best input scale.In fact, many state-of-the-art handcrafted DNNs for image segmentation are special cases of the GSN.

We propose a new Short-term Visual Memory (STVM)
mechanism for the derived network of MSD-NAS.It helps information-sharing within the derived network.
Our experiments show that the STVM mechanism further improves the segmentation accuracy.
The remainder of this paper is organized as follows.Section 2 reviews the related work, Section 3 describes the proposed neural architecture search method, Section 4 presents the experimental results and analysis, and Section 5 concludes our work.

Related work
This section first reviews the existing unmarked lane detection algorithms (Section 2.1).To justify our NAS search space, we also discuss the state-of-the-art DL-based image segmentation methods (Section 2.2).

Unmarked lane detection
Several methods have been proposed for detecting unmarked lanes.In general, there are three main approaches: traditionalbased lane segmentation, lane-border detection, and deep learning-based segmentation.Traditional-based lane segmentation involves using color models that have been pretrained to classify each pixel as either the lane class or the background class [31,35,37].The methods used for this approach vary based on the color space and classifier employed.For instance, Tan et al. [37] used color histograms in the RGB space, while Sotelo et al. [35] classified pixels using the hue-saturation-intensity space.Ramstrom and Christensen [31] used feature maps from both RGB and YUV color spaces to build Gaussian mixture models for classification.As these methods are trained offline, they have limitations when dealing with variations in lane surface, such as changes in color, texture, and shape.
To overcome the limitations of the previous methods, some methods directly model the lane pixels by selecting sample regions from the input image [3,26,29].These methods vary in the way they choose the sample regions.For example, Miksik et al. [26] initialized the sample lane region as a trapezoid centered at the bottom of the image and then refined the region using the vanishing point.In contrast, Alvarez and Lopez [3] randomly selected small areas at the bottom of the input image, assuming that these areas are the road surface.However, these methods are sensitive to the sample regions' quality and may require domain expertise.
Lane-border detection identifies the lane boundaries by utilizing the vanishing point [19,32] or templates of the lane boundaries [12].Kong et al. [19] detected the lane borders by examining the edges pointing to the vanishing point, while Crisman and Thorpe [12] identified the lane boundaries from the edges of homogeneous color regions by matching the lane templates.These lane-border detection methods can be sensitive to background edges and the accuracy of vanishing point estimation.To tackle this issue, Chang et al. [6] combined the lane-border detection from the vanishing point and the lane segmentation by the color model approaches.Another method, proposed by Phung et al. [30], used the vanishing point to construct the sample lane region.The final lane region was determined by using a color model trained from the sample lane region, and the matching scores between the edges of the homogeneous color regions and the lane templates.Both the traditional-based lane segmentation and lane-border detection approaches rely solely on image processing techniques, which are slow and not robust.
Deep learning-based segmentation detects the pedestrian lane from a scene by performing pixel-wise classification, i.e., semantic segmentation, using a deep neural network.Nguyen et al. [28] combined a Gaussian process classifier with a DNN.Thanh et al. [39] proposed a Gabor DNN, which uses variational Bayesian inference for semantic segmentation of pedestrian lanes.Both approaches perform uncertainty estimation to improve the reliability of the segmentation results, however, this comes at the cost of additional processing time.An alternative method proposed by Ang et al. [4] used neural architecture search to find a faster DNN for pedestrian lane segmentation.Still, the search space does not cover the most advanced segmentation networks, which will be discussed in the next section.

Deep learning-based semantic segmentation
The key difference among the different DL-based segmentation methods is the network architectural design.Many attributes of the existing architectures overlap, hence we group them based on their key novelty.
Fully convolutional network.Most of the current deep learning methods for image segmentation adopt a fully convolutional network (FCN) because of its efficiency [2,36].With a single forward-pass, FCN can generate the output segmentation map of the same size as the input image.The idea of using an FCN for image segmentation is first introduced by Long et al. [25].They converted a standard convolutional neural network (CNN) into an FCN by replacing all fullyconnected layers with convolutional layers.
Encoder-decoder framework.The recent FCN models follow a more systematic structure: the encoder-decoder framework.The encoder extracts salient features from the input images, and the decoder generates the output segmentation maps from the extracted features.For the encoder, many authors adopted a top-performing convolutional neural network [7,22].For the decoder, some authors designed their own decoder network [11], while other authors used a mirrored design of their encoder network [5].
Skip-connection.In the encoder-decoder framework, the feature maps transform from high-resolution to lowresolution (encoding stage), and then from low-resolution to high-resolution again (decoding stage).Hence, the finegrained information may be lost while encoding and fail to be recovered while decoding.To overcome this problem, Ronneberger et al. [33] used skip-connections to transfer the high-resolution feature maps from the encoder to the decoder.Additionally, Long et al. [25] found that combining the feature maps of the earlier layers and the penultimate layer via skip-connections can improve segmentation performance.
Multi-scale processing.Both local and global contextual information is useful for accurate segmentation outputs.This can be exploited by processing feature maps at different scales.Lin et al. [23] proposed the feature pyramid network (FPN) for object detection and later extended it to image segmentation.Since different depths in the decoder process feature maps of different scales, FPN exploits this pyramidal characteristic by making predictions at every depth.Zhao et al. [43] developed the Pyramid Scene Parsing Network (PSPNet), which contains a pyramid pooling module that uses pooling operations of different sizes to downsample the input feature map into different scales.Wang et al. [41] proposed HRNet, which has multiple resolution streams that exchange feature maps in parallel.
Dilated convolution.Dilated convolution (also known as atrous convolution) manipulates the receptive field by using a sparse kernel.For example, a 3 × 3 convolution with a dilation rate of 2 can be visualized as a 5 × 5 convolution kernel with every second row and column emptied.Hence, the receptive field can be increased without incurring additional parameters.
A popular DL-based segmentation model that uses dilated convolution is the DeepLab family, namely DeepLabv1 [8], DeepLabv2 [9], DeepLabv3 [10], and DeepLabv3+ [11].DeepLabv1 replaces the last few convolutional layers in an FCN with atrous convolutional layers to maintain higher resolution feature maps.Subsequently, DeepLabv2 introduces the atrous spatial pyramid pooling (ASPP) module.The ASPP module uses several atrous convolutional layers with different dilation rates in parallel, effectively processing the feature maps at multiple scales.DeepLabv3 comprises modules with atrous convolution that are placed in a cascade pattern.The latest version, DeepLabv3+, adopts the encoderdecoder framework.
As discussed above, there are many network architectures with different attributes, each with its own strengths.Inspired by these studies, we incorporate these design elements into our NAS search space.The network candidates in our search space are fully convolutional, have access to high-resolution feature maps, can utilize multi-contextual information, and can perform dilated convolution.

Methodology
We introduce a new neural architecture search algorithm to design the best DNN for pedestrian lane segmentation.The proposed method, named MSD-NAS, can design a network architecture with multiple input branches.Therefore, the derived network can utilize multi-scale information effectively.MSD-NAS finds the optimum architecture from a dense search space, called the Generalized Segmentation Network.Additionally, we propose a novel Short-term Visual Memory mechanism to better facilitate information sharing within the derived network.
This section is organized as follows.Section 3.1 introduces the Generalized Segmentation Network.Section 3.2 describes the architectural parameters of the GSN.Section 3.3 explains the algorithm to optimize the GSN, and Section 3.4 shows the procedure to derive the optimum network from the optimized GSN.Lastly, Section 3.5 presents the Short-term Visual Memory mechanism.

Generalized segmentation network
The Generalized Segmentation Network is a large DNN that is represented by a group of nodes and edges.Each node performs an operation (e.g., a 3 × 3 convolution operation or an identity operation), and each edge represents the information flow between two nodes.The nodes are organized into a two-dimensional grid, where the horizontal axis represents the processing layer, and the vertical axis represents the scale, see Fig. 1.
The GSN has an input layer, an output layer, and L processing layers.Processing is done at multiple image scales: 0, 1, . . ., S. At image scale s, the input layer downsamples the input image by a factor of 2 s , before passing it to the processing nodes in Layer 1.Each processing node at Scales 1 to (S-1) will produce feature maps of three different scales, while, the nodes at Scales 0 and S will produce feature maps of two different scales.Let H × W be the size of the input image.At scale s, each node produces feature maps with a spatial size of For the receiving node at scale z, the channel size of the feature maps is C × 2 z , where C is a user-defined hyperparameter.
A processing node (l, s) at layer l and scale s can receive feature maps from nodes at layer (l − 1) and three adjacent scales (s − 1, s, s + 1).This is in sharp contrast with many existing networks, where each node receives inputs from only one adjacent scale (except for the ad-hoc skip connections).At the output layer, a convolution along the third dimension of the feature map (i.e., 1 × 1 convolution) is performed to predict a segmentation map of size H 2 s × W 2 s pixels for scale s.To generate the final output, we convert the segmentation Fig. 1 The proposed Generalized Segmentation Network for image segmentation map at a selected scale to the same size as the input image via bilinear interpolation.
The proposed GSN has a generalized architecture in that not all processing nodes and edges are activated, and typically only a few nodes in the input layer and the output layer are necessary.Fig. 2 shows that many high-performing DNN architectures for image segmentation can be considered as special cases of the GSN.

Architectural parameters of the GSN
GSN can be considered a parent model containing all possible candidate operations and paths.From the GSN, our goal is to find the optimum child network.To achieve this, we use the differentiable architecture search (DARTS) [24], which is explained next.
We incorporate three types of architectural parameters into the GSN that control the relative importance of i) the input image at each scale, ii) the operations at each node, and iii) the paths to nodes in the next layer.First, the GSN has S Fig. 2 Many hand-engineered state-of-the-art deep networks for image segmentation are special cases of the proposed GSN (best viewed in color).Green node: filtering operation.Purple node: identity operation input nodes, one for each image scale.The input node (0, s) at scale s will downsample the original input image size by a factor of 1  2 s before sending it to the next layer.Each input node (0, s) is associated with an architectural parameter γ s that determines the importance of the input at scale s.Hence, the input to node (1, s) is defined as where X is the input image, and f s is the downsampling operation by a factor of 1 2 s .The parameters γ = {γ 0 , γ 1 , . . ., γ S } are normalized with softmax function to represent the importance probabilities.
Second, each node now computes a mixed operation, which is a weighted sum of multiple single operations.Let I l,s be the input of node (l, s).Here, I l,s is the sum of all the feature maps received from the connected nodes in the previous layer.Let O = {O 1 , O 2 , . ..} be the candidate operations.At node (l, s), the intermediate feature map is computed as ( Here, the architectural parameters α l,s = {α 1 l,s , . . ., α |O| l,s } denote the importance of each operation.The parameters α l,s are also normalized with softmax function to represent the importance probabilities. Third, each node at Scales 1 to (S-1) will produce three feature maps of various scales.At node (l, s), the output feature maps are defined as where f + and f − are the functions that upsample and downsample the feature map's size by 2, respectively.Note that the nodes at Scale 0 will not produce y + l,s and the nodes at Scale S will not produce y − l,s .Here, the architectural parameters β l,s = {β l,s , β + l,s , β − l,s } denote the importance of each path.The parameters β l,s are also normalized with softmax function to represent the importance probabilities.
The architectural parameters γ , α, β can be optimized using gradient descent since they are in a continuous space.However, the memory overhead for computing all the mixed operations is large as each node now consists of |O| candidate operations.To overcome this problem, we only compute the mixed operation using k input channels of the input feature map I l,s , thereby reducing the memory overhead by k times.This method is known as the partially-connected DARTS [42].Inspired by partially-connected DARTS, we modify Eq.
(2) to where B l,s is the sampled channel mask.

Optimizing the GSN
The final network is determined by the architectural parameters γ , α, and β.During the training phase of the GSN, the architectural parameters γ , α, β and network weights w are optimized alternately using gradient descent.The optimization procedure is described as follows.
The training set is split into two equal subsets, A and B. For each training epoch, first, the network weights w are updated with the training loss L A , which is computed on the training subset A. Then, the architectural parameters γ , α, β are updated with the training loss L B , which is computed on the training subset B. The architectural parameters and network weights are optimized alternately and repeatedly until convergence.Note that the network weights w are pre-trained for n epochs before begin optimizing the architectural parameters γ , α, β to avoid local optima.

Deriving the final DNN from the optimized GSN
MSD-NAS can derive K unique networks (K ≤ S), where each network processes the input image at a different scale.After the architectural parameters are optimized, we derive the networks as follows.We sort the γ = {γ 0 , γ 1 , . . ., γ S } in descending order.The input node with the largest γ determines the first selected node in Layer 1.At the first selected node, the output path with the largest β and the operation with the largest α are selected.We repeat this process for every active node until Layer L. For prediction, only the output node connected to the last active node is used.We repeat the above steps using the input node with the next largest γ until K unique networks are obtained, see Figs. 3a-c for some illustrations.
After the K unique networks are derived, we combine them into one final deep neural network with multiple input branches, with each branch handling the input image at a different scale.The networks will share the nodes if they have common segments, see Fig. 3d for an illustration.If the combined network has more than one output node, we only use the output node that predicts at the smallest scale (closest to the original image resolution).Note that we will train the final network from scratch using the full training set.

Short-term visual memory mechanism
Several studies have shown that the skip-connection scheme has many benefits [14,25,33].For segmentation neural networks, skip-connections can improve performance by transferring the high-resolution feature maps from the shallow to deep layers [25].For very large networks, skipconnections can reduce the effects of vanishing gradients [14,33].
There are three main challenges in using the skipconnection scheme with MSD-NAS.First, the skip-connection scheme is not efficient for MSD-NAS.The network derived by MSD-NAS can consist of multiple input branches; these input branches are processed sequentially, i.e., not in parallel.Hence, we need to store the intermediate feature maps if the skip-connections are between different input branches.Moreover, the processing time will be delayed if a node relies on the feature map from another input branch.
Second, in segmentation networks, skip-connections are primarily used to transfer high-resolution feature maps to aid the upsampling operations.Therefore, other types of nodes can not share information intra-branch and interbranch.Third, in the skip-connection scheme, feature maps are unweighted, i.e., all pixels are treated as equally important.It is beneficial to let the network learn the importance of each pixel to the overall segmentation performance.
To overcome these problems, we propose the Short-term Visual Memory (STVM) mechanism for the nodes.The phrase short-term arises from the fact that the memory only contains information about the current input image.Note that we only apply the STVM mechanism on the derived network, i.e., after completing the search phase.Next, we describe the STVM mechanism in detail.
In the derived network, there are at most S STVM modules.The nodes at scale s share the same STVM module m s .Each STVM module m s has the same dimensions as the output feature map of a node at scale s.Note that a feature map is a 3-D matrix, where each channel is the 2-D output of a convolution filter.Hence, the STVM module m s can be represented as [m 1 s , m 2 s , . ..].Each node interacts with its STVM module through input and update gates.The input gate determines what information to use from the STVM module, and the update gate inserts new information into the STVM module.
Here, we describe the input gate in detail.At node (l, s), we perform a channel-wise 2-D convolution on the STVM module m s to obtain the matrix a l,s = [a 1 l,s , a 2 l,s , . ..].Each element a j l,s is computed as where * is the convolution operator, and I j l,s is a 2-D learnable weight.The input gate of node (l, s) is then defined as i l,s = σ (a l,s ), (6) where σ denotes the sigmoid activation function.Now, the input gate i l,s represents the weight of every pixel in the STVM module m s .The input to node (l, s) is then given as Next, we describe the update gate in detail.Let F l,s = [F 1 l,s , F 2 l,s , . ..] be the output of node (l, s).We perform a channel-wise 2-D convolution on the output where U j l,s is a 2-D learnable weight.The update gate of node (l, s) is then defined as Now, the update gate u l,s represents the weight of every pixel in the output F l,s .We update the STVM module as follows: The interaction between a node and its STVM module is illustrated in Fig. 4.

Experiments and analysis
In this section, we present the experiments and analysis of MSD-NAS.Section 4.1 describes the pedestrian lane dataset.Section 4.2 presents the experimental steps, and Section 4.3 describes the search configurations.Section 4.4 analyzes the proposed MSD-NAS method, and Section 4.5 compares MSD-NAS with other hand-crafted deep learning models on the pedestrian lane segmentation task.

Pedestrian lane dataset
In this paper, we conducted the experiments using the Pedestrian Lane Detection and Vanishing Point Estimation Version 3.0 (PLVP3) dataset [30].The PLVP3 dataset comprises 10,000 color images with their corresponding ground-truth annotations.The ground-truth masks were manually annotated, where every pixel is labeled as pedestrian-lane (1) or background (0) classes.The images in PLVP3 were acquired from real indoor and outdoor scenes in various weather conditions and at different times of the day.The pedestrian paths in these images are diverse in shapes, colors, and textures.The cameras used to acquire these images are also different, resulting in images with varying widths and heights (ranging from 1224 to 1632 pixels).The overall statistics of this dataset are given in Table 1.Several images and their ground-truth masks from the PLVP3 dataset are shown in Fig. 5.The PLVP3 dataset can be downloaded from http://documents.uow.edu.au/~phung/plvp3.html.

Experimental steps
The pedestrian lane detection methods were evaluated using accuracy, mean intersection over union, and frames per second metrics.Accuracy is the percentage of image pixels that are correctly classified.Mean intersection over union (mIoU) is the average IoU score over each class.IoU is defined as the area of the intersection divided by the area of the union between the predicted output and the ground-truth mask: IoU = Area of intersection Area of union .Frames per second (FPS) is the number of predictions that a given method can produce in a second.The FPS was measured using a system that has a 2.4 GHz Intel Xero Gold 5115 CPU and a 12 GB NVIDIA GeForce GTX Titan Xp GPU.
The experiments were conducted using 5-fold crossvalidation.The PLVP3 dataset was divided into five partitions of equal sizes.For each fold, one partition was used as the test set, and the remaining partitions were used as the train-Fig.4 The interaction between a node (l, s) and its STVM module m s

Search settings
The search was conducted using a GSN with 14 layers (L = 14) and 6 scales (S = 6).We ran the search algorithm for 30 epochs.For each node, the candidate operations O consisted of: 1. Identity.The downsampling function f − was implemented as a conv3 operation with a stride of 2. The upsampling function f + was implemented as a bilinear upsampling operation with a scale of 2, followed by a conv3 operation.The downsampling function f s was implemented as a conv3 operation with a stride of 2 s .Every convolution operation was followed by a batch normalization operation and a ReLU activation function.
The search phase of the proposed NAS method was conducted using two optimizers.The Adam optimizer was used to update the architectural parameters γ , α, β with the following settings: learning rate of 0.0003, weight decay of 0.0005, and exponential decay rates β 1 of 0.9 and β 2 of 0.999.The SGD optimizer with momentum was used to update the network weights w with the following settings: learning rate of 0.01, momentum of 0.9, and weight decay rate of 0.0005.
The optimization of the architectural parameters γ , α, β only begun after the network weights w were trained for 20 epochs (n = 20).With these configurations, the search running on a 12GB NVIDIA GTX Titan Xp GPU took roughly 20 hours using a GSN with a base channel size of 8 (C = 8), and roughly 37 hours using a GSN with a base channel size of 16 (C = 16).

Ablation study
In this section, we performed an ablation study to: -determine the optimum number of search epochs, -find the best number of layers and scales for the GSN, -analyze the effects of using different numbers of input branches K , -determine the effectiveness of the STVM mechanism, and -compare the different values of base channel size C. Optimum number of search epochs.We ran the search five times, and each time with a different number of search epochs: i) 10 epochs; ii) 20 epochs; iii) 30 epochs; iv) 40 epochs; and v) 50 epochs.The higher the number of epochs, the longer the search will take.The derived network from each search was trained from scratch, and then the mIoU score was computed for the test set.Fig. 6 shows the results of this comparison.The network found by searching for 50 epochs achieved the lowest mIoU.We believe this is due to overfitting.The network found by searching for 30 epochs achieved the highest mIoU; the search time was also the average among all other configurations.Hence, we chose to search for 30 epochs in this paper.
Optimum number of layers and scales for the GSN.We performed a 3 × 3 grid search with the following choices: 12, 14, and 16 layers; and 4, 5, and 6 scales.Fig. 7 presents the results of the grid search.For the number of scales, the mIoU increased as the number of scales increased, except for  configurations with 16 layers.For the number of layers, the mIoU increased as the number of layers increased.However, it stopped improving after 14 layers, except on configurations with 4 scales.For the smallest GSN (12 layers and 4 scales), the derived network obtained the lowest mIoU.Among all the tested combinations, the network derived from the GSN with 14 layers and 6 scales achieved the highest mIoU.Therefore, we adopted 14 layers and 6 scales for the GSN in this paper.
Effects of using different numbers of input branches K .We tested three configurations on the same derived network: i) one input branch (K = 1); ii) three input branches (K = 3); and iii) six input branches (K = 6).Table 2 shows the results of this experiment.For all configurations, the network's mIoU, inference time, and trainable parameters increased as the number of input branches increased.The network with six input branches obtained the highest mIoU (0.9534), whereas the network with one input branch had the lowest mIoU (0.9423).We adopted the configuration with six input branches (K = 6) because it achieved the best performance, while still being able to support real-time performance.
Effectiveness of the STVM mechanism.We tested two settings on the same derived network: i) with STVM; and ii) without STVM.Table 3 presents the results of this study.The derived network with STVM (mIoU of 0.9534) outperformed the network without STVM (mIoU of 0.9462).This improvement of 0.0072 in mIoU by including STVM is significant because as seen in Table 5, the top and bottom of the 13 evaluated methods (≤ 27M parameters) differ only 0.0004 in mIoU.However, the network with STVM had more parameters than the network without STVM (10.998M versus 10.922M).This slight increment of parameters is expected because the STVM mechanism requires additional computations in every node, i.e., the input and output gates.4 presents the results of this comparison.The network derived with C = 16 (mIoU of 0.9545) had the highest mIoU; it also had the most trainable parameters (29.961M).While the network derived with C = 4 (mIoU of 0.9503) had the lowest mIoU, it also had the least trainable parameters (1.880M).Hence, selecting the value of C is a trade-off between mIoU and model size.A smaller value of C will reduce the model size and mIoU, and vice versa.In this paper, we used both C = 8 and C = 16 configurations as they yield networks of different sizes.

Comparison with the existing pedestrian lane detection methods
In this section, we compared MSD-NAS with 22 existing pedestrian lane detection methods using 5-fold crossvalidation, which included 2 traditional methods, 1 NAS method, and 19 hand-designed models.The evaluated models were grouped into two sizes: i) methods with ≤ 27M trainable parameters, and ii) methods with > 27M trainable parameters.
For a robust evaluation, we ran the proposed search algorithm on every fold.This resulted in five unique deep networks designed by MSD-NAS.We then trained these networks from scratch and tested their performances on their respective test sets.The performance of the MSD-NAS is defined as the test results averaged from these five networks.To search for two network sizes that fit the two groups, we set the C = 8 for a smaller network and C = 16 for a larger network.
Several conclusions can be drawn from the results.First, MSD-NAS can automatically design DNNs that outperform the hand-designed network architectures in terms of accuracy, mIoU, speed, and model size.The MSD-NAS (58.82 FPS for C = 8 and 52.63 FPS for C = 16) also can support real-time pedestrian lane detection as most video cameras capture at 30 to 50 FPS.
Third, the type of encoder used will affect the segmentation performance of the same architectural design.For example, PSPNet (ResNet101) outperformed PSPNet (ResNet34), and DeepLabV3+ (ResNet101) outperformed DeepLabV3+ (Xception).Hence, for every new problem, we need to consider different encoders for the same architectural design too.

Conclusion
This paper introduces a new neural architecture search algorithm, called MSD-NAS, for pedestrian lane detection.MSD-NAS can automatically design an optimized DNN with multi-scale input branches, allowing the derived network to utilize global and local contextual information for predictions.The search space of MSD-NAS is the GSN, which is a large and dense space that also contains many existing hand-designed deep models as candidates.Hence, the search space of MSD-NAS is sufficiently large and generic.Additionally, we propose a novel Short-term Visual Memory mechanism to improve information facilitation among the nodes in the derived DNN.Our experiments demonstrate that MSD-NAS can automatically find compact and fast DNNs that outperform 22 existing methods in some or all performance metrics (accuracy, mIoU, processing speed, and model size).Notably, the DNN found by MSD-NAS is 20.16 times faster and 2.56 times smaller than the existing best deep learning model, Multiscale HRNet-OCR, while having a statistical insignificance difference in accuracy and mIoU.This paper demonstrates a concrete step toward a practical system for the assistive navigation of blind people.
Although MSD-NAS can achieve high performance in the pedestrian lane detection task, several research directions can be further explored in future work.The first research direction is optimizing the search time of MSD-NAS.Currently, the search process of MSD-NAS is split into two phases: optimizing the GSN and the derived network.The optimized GSN weights are discarded, and the derived network is trained from scratch.Future work can explore re-using the GSN weights or combining the two phases.The second research direction is incorporating more search criteria into the loss functions.At present, MSD-NAS only focuses on obtaining the best accuracy.MSD-NAS may find more efficient DNNs by directly considering the latency or model size in the loss function.

Fig. 3
Fig. 3 An illustration of the network derivation procedure when K = 3 F l,s to obtain the matrix b l,s = [b 1 l,s , b 2 l,s , . ..].

Fig. 7
Fig. 7 Results of the grid search for finding the optimum number of GSN layers and scales.Value inside the box: test mIoU

Fig. 8
illustrates the STVM module at scales 1, 3, and 5.For simplicity, we only visualize the channel with the highest activation.The figure shows that the STVM mechanism

Fig. 9
Fig. 9 The MSD-NAS (C = 16) derived from fold-1 of the dataset.The input and output nodes are omitted for conciseness.White circle: node.Blue text: operation.Solid black line: data flow between nodes

Table 1
Statistics of the PLVP3 dataset

Table 2
The effects of using different numbers of input branches

Table 3
The effects of using the STVM mechanism in the derived network

Table 4
The comparison between the different values of base channel size C

Table 5
The mean performance of different lane segmentation methods over five folds.The best metrics in each group of methods were given in bold *We reject the null hypothesis H 0 : m MSD-NAS = m other at a confidence level of