BACNN: Multi-scale feature fusion-based bilinear attention convolutional neural network for wood NIR classification

Effective development and utilization of wood resources is critical. Wood modification research has become an integral dimension of wood science research, however, the similarities between modified wood and original wood render it challenging for accurate identification and classification using conventional image classification techniques. So, the development of efficient and accurate wood classification techniques is inevitable. This paper presents a one-dimensional, convolutional neural network (i.e., BACNN) that combines near-infrared spectroscopy and deep learning techniques to classify poplar, tung, and balsa woods, and PVA, nano-silica-sol and PVA-nano silica sol modified woods of poplar. The results show that BACNN achieves an accuracy of 99.3% on the test set, higher than the 52.9% of the BP neural network and 98.7% of Support Vector Machine compared with traditional machine learning methods and deep learning based methods; it is also higher than the 97.6% of LeNet, 98.7% of AlexNet and 99.1% of VGGNet-11. Therefore, the classification method proposed offers potential applications in wood classification, especially with homogeneous modified wood, and it also provides a basis for subsequent wood properties studies.


Introduction
As a biopolymer material, wood has the advantage of being renewable and is widely used in construction, furniture, and aerospace applications (Huang et al. 2020).With the decline in forest resources and shortage of quality wood resources, attention is turning to fast planted forestry wood and wood modification.The functional modification of wood by physical and chemical methods, and therefore the creation of new types of wood characterized by high added value and versatility is important for economic and environmental development (Macior et al. 2022).However, there are considerable variations in the physical and chemical properties of wood among different species, and it is particularly challenging to distinguish between modified wood and traditional wood, and a more effective classification technique is necessary.NIR spectroscopy has been widely implemented in forestry research to can provide information on the internal functional groups of wood, and thus confirm the class of wood and its physical and chemical properties without relying extensively on surface characteristics such as colour and grain.For example, Wang et al. (2015a) used a cluster analysis model, a Bayesian discriminant model and a support vector machine model to classify the NIR spectra of ten woods with an accuracy of 83.3%, 86.7% and 85.0%, respectively.In a following study, Wang et al. (2015b)   the NIR spectra of a total of 296 wood samples of five tree species using a BP neural network model.The classification accuracy reached 100% for species of different genera and more than 85.0% for species of the same genus.However, the traditional method for classification is dependent on preprocessing of the spectra and requires manual extraction of the required features before classification, which is subject to significant interference from human error (Nisgoski et al. 2017).Convolutional neural networks, as a data-driven modeling approach, have been widely used in NIR spectral feature extraction due to their powerful extraction capabilities.Especially after the emergence of AlexNet (Krizhevsky et al. 2012), various high-precision deep convolutional neural network models have emerged such as LeNet (Lecun et al. 1998), VGGNet (Simonyan and Zisserman 2014), Goog-leNet (Szegedy et al. 2015), ResNet (He et al. 2016), and DenseNet (Huang et al. 2017).Convolutional neural networks combined with NIR spectroscopy reflect the internal characteristics of the sample and avoid the drawbacks of redundant information in NIR spectroscopy.They are widely used and their specific applications are in two forms: First, feature extraction is performed directly from high-dimensional raw spectra (Lecun et al. 2015), using the powerful extraction capabilities of convolutional neural network models in one-dimensional spectra to extract their intrinsic features, after which the features are exploited using the traditional exploitation approach in chemometrics.Secondly, a high-level original spectrum is reconstructed by downscaling using preprocessing methods such as PCA, and the reconstructed spectrum is feature extracted using convolutional neural networks which can transform low-dimensional data into high-dimensional abstract features by multi-level nonlinear modules.Through layer-by-layer feature extraction, the model can eventually learn complex feature representations.Consequently, convolutional neural networks have an increasingly important role in spectral analysis (Chen and Wang 2018).For example, Jia et al. (2020) built an 8-layer convolutional neural network model to analyze and predict the near-infrared spectra of water quality with a prediction accuracy of over 99.0%.Tang and Chen (2021) used convolutional neural networks to analyse the near-infrared spectra of soils for Ph prediction with an accuracy of 90.0%.Xia et al. (2021) classified plastics using one-dimensional convolutional neural networks with 100% accuracy.Yang et al. (2020) used convolutional neural networks to classify softwood with over 99.0% accuracy.The original NIR spectra of softwood using NIR spectroscopy was classified by Pan et al. (2022) to successfully distinguish 21 wood samples.
This study combined NIR spectra with convolutional neural networks and presents a bilinear attentional convolutional neural network model (BACNN) based on multi-scale feature fusion to classify the NIR spectra of poplar wood (PW), tung wood (TW), balsa wood (BW), PVA modified poplar wood (PVAW), nano-silica-sol modified poplar wood (SW) and PVA-nano silica sol modified poplar wood (PSW).BACNN mitigates the effect of noise by adding convolutional kernels as 1 × 7 convolutional blocks (Zhang et al. 2017), after which the features of the spectrum are extracted from different scales using two branches, while adding SE modules (Hu et al. 2018b) to the two branches to obtain better quality features and finally using fully connected layers for classification.The main innovations of this paper are: (1) Modification experiments were conducted on poplar wood to obtain the experimental samples, the nearinfrared spectra of the samples were collected, and were data enhanced to obtain the data set.

Specimen preparation
In this experiment, PW, TW, BW, PVAW, SW and PSW were experimental samples.The materials used were free of cracks, knots, discoloration, or other defects to minimize the disturbance of environmental variables.The logs were placed at an ambient temperature of 20 °C and a relative humidity of 65% to reach a state of moisture absorption equilibrium before making the modified woods.Anhydrous ethanol (> 99%), deionized water, sodium sulphite (> 96%), sodium hydroxide (> 96%), polyvinyl alcohol (PVA) and silica nanosol solutions used in the experiments were purchased from the University laboratory.All raw materials were used without further treatment.
The modified poplar wood was first prepared by configuring a 3% aqueous PVA solution and immersing the original wood in the solution for 24 h at room temperature and pressure.After impregnation, the wood samples were removed and washed with deionized water to finally obtain PVAW.Similarly, wood samples were placed into the nano-silica-sol solution and treated to obtain SW.The PVA: nano-silicasol was mixed at 3:1 (mass ratio), sonicated for 30 min and stirred for eight hours at room temperature with a magnetic Page 3 of 13 4 stirrer to obtain the impregnated modified solution.The original poplar wood was impregnated and treated to obtain PSW.The experimental samples are shown in Fig. 1.

NIR spectra measurements
A Nicolet iS10 Fourier transform infrared spectrometer (FT-IR) was used to collect wood spectra with a resolution > 0.4 cm -1 controlled by the OMSNIC software, allowing the surface of the sample to be scanned at 400-4000 cm -1 .The number of scans per point was set at 32 and the absorption spectrum gathered.The background spectrum was first measured, saved, the wood sample then placed on the spectrometer detector and the absorption spectrum measured.To increase the richness of the spectral samples and have the spectrum fully reflect their characteristics, different parts were collected of each wood, with five pieces of wood were selected for each type of sample.Thirty points were collected for each piece of wood for a total of 906 near-infrared spectral data.

Enhancement and pre-processing of NIR spectra
Deep learning networks require features to be extracted from large volumes of raw data and good performance can only be achieved with a relatively large amount (Gao et al. 2021a).Therefore, augmentation techniques are necessary to enrich the data to improve classification accuracy and prevent overfitting.In this study, tiny random Gaussian noise was added to data augmentation, setting the mean to 0 and variance to 0.02, doubling the data to 1812, and spectrally adding random Gaussian noise.The enhanced spectrum shows essentially the same spectral trend compared to the initial spectrum, with several important peaks and troughs remaining unchanged.The addition of noise does not change the chemistry of the spectra, but only the fluctuations become larger, so the difficulty of identifying peaks and valleys increases subsequently.
The raw spectrum contained unimportant information such as noise, and also had overlapping peaks (Kauppinen 1983), it was necessary to attenuate the noise and separate the overlapping peaks of the spectrum while downscaling it for reconstruction to reduce the amount of data.This study used S-G convolutional smoothing (Soares et al. 2016) combined with principal component analysis (PCA) (Wang et al. 2017) to preprocess the original spectrum.S-G convolutional smoothing uses polynomials for data smoothing and is based on the least squares method, which can retain useful information in the analyzed signal and eliminate random noise.The PCA method is capable of downscaling the spectra and also of enhancing overlapping peaks, which can be used to solve the problem of spectral overlapping peaks (Kuesel et al. 1996).In the experiments, the best results were obtained when the window length was set to eleven and fitted with second order (Fig. 2).Eight hundred principal components were then extracted from 6950 features using PCA to improve the spectral signal-to-noise ratio, enhance the overlapping peaks and solve the spectral overlap problem (Pachuta 2004).The spectra were reconstructed by dimensionality reduction to further reduce the data volume (Qin et al. 2013).Finally, the data were randomly divided into training and test sets at a ratio of 7:3 (Table 1).

Convolutional neural network
The purpose of convolution is to perform feature extraction.It is the filter which is trained to suppress distracting information such as noise and extract the main features for classification.When a convolutional neural network is used for feature extraction, its formula is Eq. 1.
where, x k i is the output of channel i of the k th layer of the convolutional kernel; f is the activation function;len is the length of the convolutional kernel;x k−1 is the output of the previous convolutional layer;b l b is the bias;W l i is the weight matrix.During the training process, the network model continuously adjusted the weight matrix and bias until the loss function was reduced to the ideal value, at which point the input data was given more weight by the convolution layer for the main features and less weight for the noise to achieve good classification.When applying convolutional neural networks to spectral analysis, the features shared by the network parameters help reduce the number of parameters (1) and prevent overfitting (Jiao et al. 2019).The Max-pooling operation compresses the input feature map and extracts the main features of the spectrum (Graham et al. 2014).The local perception ability of convolution facilitates the extraction of wave peak and trough features in the spectrum by convolutional neural networks, suitable for one-dimensional spectral analysis.Acquarelli et al. (2017) noted that the convolutional neural network uses the learned convolutional kernel for smoothing and derivative filtering of the input information to solve the problem of noise in the spectrum as well as overlapping peaks.So the dependence of convolutional neural networks on spectral pre-processing operations is greatly reduced.In this study, it was found that the classification accuracy of the model decreased when the first-order derivative spectrum and the second-order derivative spectrum were solved after training, which led to the conclusion that the spectral pre-processing method of S-G convolutional smoothing, combined with PCA, enables the convolutional neural network model to adequately eliminate noise and cope with interference from problems such as overlapping peaks, without the need for additional operations to further pre-process the spectrum.

Network structure
Common methods to improve the classification accuracy of neural networks include increasing the depth of the network model and adding an attention mechanism.Due to the large amount of input information in this paper, the required network model is deeper and therefore required a large amount of computational resources, so a bilinear branching network with multi-scale feature fusion was used instead of increasing the depth of the network model.An attention mechanism was added to obtain more detailed features for classification.The BACNN used in this study is an 11-layer, bilinear convolutional neural network model where two branches with different convolutional kernel sizes are used to extract multi-scale spectral features.The SE module was added to reduce the interference of other information.When the convolutional kernel is set to 3, i.e., relatively small, the accuracy of the network is higher and can accurately identify weak and overlapping peaks when the running speed is slower.When the setting is relatively large, it can filter noise and increase the training speed, but for weak peaks and overlapping summits, it results in misjudgment and omission.Therefore, in this study there were two different scales of the branch to obtain different features and then fusion, which not only reduced the preprocessing of the sample, it further filters noise, obtains more accurate features, and reduces the training time.
The structure of BACNN is shown in Fig. 3.The input spectral signal first passes through two convolutional layers with a 1 × 7 convolutional kernel to attenuate the interference of noise.The spectral features were then extracted from different scales by two neural network branches, CsA and CsB (Wang et al. 2019).In addition, higher quality features were obtained by adding the SE module, the two features then fused using a bilinear pooling operation, and finally, the one-dimensional vector from the fusion was fed to the fully connected layer for classification.Table S1 shows the detailed parameters of BACNN.Multi-scale feature fusion BACNN extracts features from different angles of the spectrum using two convolutional neural network branches with different convolutional kernels.A 1 × 3 kernel is assigned to CsA to extract coarse features while reducing the parameters of the network, and a 1 × 5 kernel is attached to CsB.Given that a large convolutional kernel increases the perceptual field, CsB is able to extract accurate features.The BACNN is represented by Eq. 2: where F stands for 1 × 7 convolutional block, CsA and CsB represent two linear branches,Fc 31 and Fc 32 refer to fully connected layers.
The CsA branch and the CsB branch can be expressed by Eqs. 3 and 4, respectively.
As fusion requires the same dimensionality of the feature vectors output from both branches, this study adjusts the feature vectors of the SE module to 1 × 512 using an adaptive maximum pooling layer.The fully connected layer is replaced with a pooling layer to reduce the amount of data while ensuring that both have the same dimensionality.The two are then cascaded along the vertical axis to obtain a 1024 × 1 feature vector, as shown by Eq. 5: where, x 1 and x 2 represent the output of the two adaptive maximum pooling layers, and f the fused feature vector which contains all the features of the two scales.It represents the features more comprehensively, and then connects the fully connected layer with the softmax (Asadi and Littman 2017) layer for classification.

SE module
The SE module, as a channel attention network, has the core focus of modelling the interdependence between channels, assigning different weights to the feature vectors of different channels, and then summing them proportionally to obtain more accurate features.Fig. S1 shows the structure of the SE (2) B = (F, CsA, CsB, Fc 31 , Fc 32 ) module The SE module contains three parts: squeeze, excitation, and scale.The squeeze module, compresses the feature map into 1 × 1 × C vector by performing global average pooling on the input vector.Next, the excitation operation, consists of a fully connected layer with a stack of activation functions.Finally, the essence of the scale operation is the multiplication of channel weights.The SE module calculates the weight value of each channel and multiplies it with each channel, thus assigning different proportions to different channels to get better results.

× 7 block
The size of convolutional kernel is important for convolutional neural network models.Networks with small convolutional kernels have the advantages of small computation and fast convergence, but the receptive field is small and easily disturbed by noise.Although convolutional neural networks with large kernels are more computationally intensive, they have a larger field of perception and a suppression effect on noise.Since the input data used in this study were nearinfrared spectra of wood, interference from environmental noise was inevitable in the process of obtaining the spectra and at the same time, this study used the addition of random Gaussian noise for data enhancement.There are many noises in the data so this study used two convolutional layers with a convolutional kernel of 1 × 7 to form a block to suppress the effect of data noise.

Pipeline
The pipeline of wood classification is shown in Fig. S2.First, the original wood sample was prepared and impregnated to obtain the modified sample.Second, the spectrum of the sample was collected using a near-infrared spectrometer, and the spectral data enhanced and pre-processed to obtain the final l data set.Finally, the spectrum data was fed into the BACNN network and the classification results outputted by the softmax layer.

Experimental environment and hyper-parameter
The server used in this experiment was Windows 10, the processor Intel(R) Xeon(R) Bronze 3204, the memory 128 G, and the GPU NVIDIA GeForce RTX 3090 (Table 2).When training the model, the ratio of training to test set was 7:3.The test set contained 544 spectral data and the training set 1268.The hyperparameters include the iterative Epoch, Batch size and Learning rate (L r ), which are set to 200, 800 and 1e-3, respectively.

Experimental results
The loss function variation curves of the training and test sets during the training are shown in Fig. 4. From Fig. 4, in the training set, the loss function reached a minimum of 0.01 when it was iterated to 100 times.This indicates that the model can converge quickly and can fully learn the features.In the test set, the accuracy reached 99.6% at 61 iterations, and the loss function simultaneously fluctuated around 0.025, indicating that the model had good generalization ability and can make accurate predictions for unknown species of wood in the test set.

Comparison test
To evaluate the superiority of the BACNN, it was compared with BP (Rumelhart et al. 1986), SVM (Ma et al. 2020), AlexNet, LeNet, and VGGNet-11.BP and SVM are traditional machine learning methods with a comprehensive comparison test of P, R, F1 and Accuracy for each category; AlexNet, LeNet, and VGGNet-11 are deep learning-based methods that use convolutional neural networks to extract features for classification.This study compared and analyzed the changes of loss function and accuracy curves of these models and their confusion matrices, i.e., compared the combined P, R, F1 and Accuracy metrics of these models and drew conclusions.

Evaluation indicators
Using precision (P), recall (R), F1-score (F1) and Accuracy as evaluation metrics, all are defined as follows: (6) where, TP means the prediction is true, the actual is true, FP that the prediction is true, the actual is false, FN that the prediction is false, the actual is true, and TN that the prediction is false, the actual is false.

Comparison with traditional algorithms
Table 3 shows the comparison of BACNN with BP and SVM in four metrics, P, R, F1, and Accuracy.Bold represents the model proposed in this paper and its associated performance metrics.Since no manual feature extraction was performed, the indicators were not satisfactory using the BP neural network for classification with an accuracy of only 52.9%.Compared with the BP neural network, SVM showed a substantial improvement in all indexes, with an accuracy of 98.7%.Since NIR spectra can reflect the information of functional groups inside the wood, SVM can accurately identify the species of modified poplar wood.However, there are errors in classifying different species.When BACNN classified six types of wood material, all indexes perform well and the accuracy rate was as high as 99.6%, and the performance better than BP neural network and SVM. (8)

Comparison with deep learning methods
LeNet, AlexNet and VGGNet-11 were compared with BACNN to verify the superiority of BACNN.LeNet-5 has five layers, AlexNet has eight, and VGGNet-11 eleven layers.BACNN has eight layers in a single branch and eleven layers in total.The variation of the loss function curves for the four models is shown in Fig. 5. Table 4 shows the accuracy of the prediction set, the time required for training and the size of the resulting network model.
As seen in Fig. 5, when training on the training set, BACNN first converged first and the process was the most stable.LeNet converged the slowest, stabilizing only after 170 iterations; AlexNet and VGGNet-11 converged quickly but were unstable.In the test set, BACNN performed with the highest accuracy, VGGNet-11 s, AlexNet the third and LeNet the lowest (Table 4).The training times and model sizes for the four models are also shown in Table 4, with LeNet, AlexNet, BACNN and VGGNet-11 models increasing in size and training time in this order.
The training time and model size of LeNet, AlexNet and VGGNet-11 increased and the prediction accuracy also increased gradually.The training time and model size of the BACNN were smaller than those of VGGNet-11, but the prediction accuracy was the highest, which shows the advantages of the BACNN.
In the confusion matrix in Table 5, BACNN identified only four samples incorrectly, LeNet identified eleven samples, AlexNet seven samples, and VGGNet-11 five samples incorrectly, indicating that BACNN had the smallest prediction error on the test set.
From these comparison tests (Fig. S3), it can be seen that the classification accuracy was significantly higher than that of BP neural networks and SVMs, compared to traditional machine learning methods because BACNN uses a convolutional neural network model as a feature extractor to automatically perform feature extraction.When comparing BACNN with LeNet, AlexNet and VGGNet-11, LeNet and AlexNet have less training time and smaller models than BACNN, but the training process is unstable, convergence is slower and classification accuracy is lower.In contrast, compared with BACNN, the VGGNet-11 model is larger and takes longer to train, but convergence speed is lower and the classification accuracy is slightly lower.From the above analysis, it can be concluded that BACNN has superior classification results compared to common classification models.

Ablation experiments
To verify the rationality of the structure of the BACNN, ablation experiments were set up with the rationality of the two-branch network first verified.The BACNN was tested against the CsA the CsB branches to prove the superiority of the two-branch network over the single-branch one.Following this, the rationality of adding the SE module was verified, and by comparing the two cases with and without the SE model, it was demonstrated that adding SE module helped improve the performance of BACNN.It was also shown that adding 1 × 7 convolution blocks suppressed the noise and enhanced the generalization ability of the BACNN.

Bilinear branching
Bilinear branching extracts spectral features from different scales and then fuses them to obtain more comprehensive and higher quality features.In this section, the effectiveness of the bilinear network was analyzed, and three cases presented: 1: bilinear model (BACNN); 2: upper branch network (CsA); and 3: lower branch network (CsB).The variation of the loss function for these three cases are shown in Fig. 6.It is noticeable from Fig. 6 that CsA converged the fastest during training for filtering features with smaller convolutional kernels, while CsB was the slowest for extracting features with larger kernels that are more fine-grained.Although the BACNN model is more complex and involves multi-scale feature fusion, it still converged faster than CsB, with an accuracy of 99.6% on the test set, compared to 98.9% for both CsA and CsB.BACNN not only had the highest accuracy and the best generalization capability, but also more stability.The results show that the network structure using two-branch networks to extract features at different scales is superior to that of single-branch networks.

SE model
Because SE modules are 'plug-and-play', they can be easily and effectively used in various networks.In this study, the effects of the presence or absence of SE modules and their location and number were analysed on the performance of the BACNN.As the SE modules require the number of channels to be > 16, the SE modules were placed in two branches and divided into the following four cases: condition 1: no SE module; condition 2: the SE module placed after the first convolution; condition 3: the SE module placed after the    second convolution; condition 4: SE module placed after both convolutions.The accuracy of the test set in these four cases is shown in Table 6.Bold represents the position of the SE module when accuracy is highest.
As shown in Table 6, the prediction accurateness of the model was inferior when there were no SE modules.However, the accuracy improved after adding one SE module to each branch, indicating that adding SE modules at this point improves the accuracy of the model.However, the accuracy in condition 4 was the same as in condition 1, indicating that too many SE modules had been added at this point to the detriment of the model's performance.The comparison between conditions 2 and 3 shows that the prediction accuracies of the model were better when the SE module was added to the second convolutional layer.This indicates that, under the experimental conditions of this study, the SE module should be located at a position with a high number of feature channels to be more effective.

× 7 block
Because the NIR spectral data contained considerable noise, a layer with 1 × 7 convolutional kernel was used to enhance the noise immunity of the model.To test the effectiveness of the 1 × 7 block, the loss function of the training set with and without the 1 × 7 block were compared separately (Fig. 7).
As seen in Fig. 7, the loss function fluctuated considerably at the onset of training by the time the 1 × 7 blocks were eliminated.Since the network model without 1 × 7 blocks is more simplified, its initial convergence speed was faster than BACNN, but when the training reached 30 times, the convergence speed became significantly slower with the final speed rather slower than BACNN.With the addition of the 1 × 7 block, the generalisation of the BACNN model was significantly improved, with the accuracy on the test set 99.6%, higher than 95.8% achieved when the 1 × 7 block model was removed.In summary, adding the 1 × 7 block suppresses the noise in the spectral data and improves the stability of the training process.Although the complexity of the network increases, the convergence speed is faster due to less noise interference, which improves the generalization ability of the model and further improves its classification capability.
From the ablation experiment, it may be concluded that setting up a bilinear network for multi-scale feature extraction is reasonable and provides a better performance than a single-branch network.By adding the SE module, it allows

Conclusions
This paper presents a bilinear attention convolutional neural network (BACNN) model based on multi-scale feature fusion to classify the near-infrared spectra of six classes of modified wood with 99.6% accuracy.Comparison tests with BP, SVM, LeNet, AlexNet and VGGNet-11 showed that the BACNN achieved optimal results with both traditional machine learning methods and deep learning methods, verifying its superiority.Ablation experiments showed that the accuracy of the model could be improved by using a twobranch network to extract features from different scales, adding SE modules and 1 × 7 blocks to the network model, and thus proving the rationality of the BACNN network model structure.Based on the above, the BACNN proposed in this paper can achieve automatic feature extraction and optimal classification results when classifying wood NIR spectra.In addition, the BACNN is also expected to contribute to wood performance prediction. classified

Fig
Fig. 1 Experimental samples

Fig
Fig. 3 Network Structure

Table 1
Number of training and test sets

Table 3
Model prediction analysis

Table 4
Prediction set accuracy

Table 6 SE
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.