Introduction

Petrographic image analysis involves dividing mineral grains from thin-section petrographic images to identify grain size and shape and then determine the rock’s composition and structure. Petrographic image segmentation is the primal and crucial step for petrographic image analysis and can be reduced to a binary classification task, with images being separated into edge and background categories. However, this simple binary classification task remains an intriguing problem, especially given the blurred and overlapping edges of the three-dimensional grains observed from the two-dimensional image. This becomes even more challenging as the large-scale color and intensity variations of the grains depend on several factors (explained in detail below).

Automated solutions, such as scanning electron microscopy (SEM), are constrained by minerals that are chemically equivalent but have different optical reflectance properties. This is why, in the past, petrographers had to manually segment hundreds of grains by tracing their contours. Besides, this time-consuming and tedious task also requires the petrographer to combine all the features of plane and cross-polarized light at various polarization angles, which is a subjective process that puts a high premium on the petrographer’s expertise and knowledge.

Most anisotropic minerals lack color when seen using a plane-polarized light microscope, but grains’ hue and intensity vary continually when viewed through a cross-polarized light microscope under different angles of polarized light [1]. Due to the continuous variation of angle between the orientation of the crystal axes and the polarized light, one may observe that the brightness of grains peak at a particular polarization angle while completely turning into the dark as it gradually shifts to another angle, which is a phenomenon referred to as the extinction. Other factors affecting the color of the grain include the thickness of thin sections, the optical properties of different grains, and the crystal structure of minerals. Overall, the segmentation results are prone to over- and under-segmentation problems. On the one hand, the edges of neighboring grains may become blurry, as well as the texture of certain types of rock may be mistakenly considered to be the edge. On the other hand, over-segmentation frequently takes place as a result of impurities in the grains being confused for the smaller grains.

Due to the intricacy of grain edge detection, previous approaches fail to generate an accurate and reliable segmentation result. This paper presents an automatic two-class edge segmentation model for the plane and cross-polarized petrographic images. The following paper is organized as follows. “Literature review” is the literature review of previous works on grains segmentation or grains’ edge detection, and “Theoretical background and methodology” is a detailed description of the Extinction Consistency Perception Network proposed in this paper. The relevant ablation experiments and comparison experiments are shown in “Experiments and results”. Finally, “Conclusion and future work” concludes the entire paper.

In this study, we propose a CNN-based computer vision methodology for generating reliable edge segmentation maps, with a specially designed block for the inception of extinction phenomena from several thin section petrographic images under cross-polarized light. A general summary of our primary contributions is as follows:

  • The extinction consistency perception network (ECPN) is proposed and trained from scratch. The main part of this model, namely the multi-scale edge perception network (MEPN), is composed of the modified EfficientNetV2 and the proposed BiDecoder. It has been proven to be an effective framework for improving feature extraction and aggregation.

  • The proposed multi-angle extinction consistency (MAEC) block can be seen as a preprocessing stage to capture the extinction of grains and augment pixels within the same grains into edge-enhanced features. It consists of two parts: the extinction consistency enhancement (ECE) block and the squeeze and excitation (SE) block, which function in the spatial and channel dimensions respectively.

  • The distance map penalized compound loss function is constructed to direct the network’s attention toward the grains’ boundaries. In comparison to the widely used cross-entropy loss function, it penalizes dissimilarity not only in terms of statistical distributions but also the mismatches across overlap zones.

  • A dataset, named the cross-polarized petrographic image datasets (CPPID),Footnote 1 has been shared to the community with precise ground truth of mineral grain edges. Experiments on CPPID demonstrate that the proposed ECPN model outperforms several classical edge detectors by a large margin, achieving 0.940 ODS and 0.941 OIS.

Literature review

We first review several models for detecting grain edges in petrographic images using traditional image processing methods, which may be divided into four types: edge-based methods [1,2,3], energy-based methods [4], region-based methods [5,6,7,8] and machine learning methods [9,10,11].

Edge-based approaches utilize changes in intensity or luminance to determine grain boundaries, but they depend heavily on hand-crafted feature filters and hence cannot ensure boundary closure. Zhou et al. [2] first transmitted phi- and max-images through an enhanced Canny detector. Then, a region-growing method was applied to segment the generated edge maps. In the subsequent stage, the segmented phi- and max-images were merged into a single image that exceeded the phi- and max-images separately. Goodchild et al. [1] extracted edges from a gradient image using a fundamental gradient operator and estimated closed edges utilizing a number of image processing techniques. Heilbronner et al. [3] proposed a method for creating grain boundary maps from petrographic thin sections called lazy grain boundary (LGB). It extracts boundaries using gradient filtering on sets of regular polarized micrographs and combining the most significant grain boundaries on each image in a given input set to produce a single grain boundary map.

The goal of energy-based approaches is to minimize the energy functions, but these approaches are computationally costly and it might be challenging for them to converge on the optimal method. Jungmann et al. [4] seek to reduce the value of an energy function to achieve the segmentation. They suggested extending the MDL-based region merging approach, which merged edge features across nearby areas.

Region-based methods cluster pixels with comparable attributes which are capable of generating tight borders, however, they are imprecise. Feng et al. [6, 7] proposed a two-stage technique that involved first generating superpixels using a modified simple linear iterative clustering (SLIC) algorithm and then combining them using a designed area merging. Utilizing a series of cross-polarized pictures, Lumbreras et al. [8] over-segmented images and combined over-segmented blocks based on the grain’s preferred direction. Zhou [5] first defines a set of edge operators with varying mask sizes before employing a colored edge detection algorithm to extract colored edges with a large neighborhood to reduce noise. The image is then segmented using a seed region growth algorithm based on color edge information and the distance between the edge and non-edge pixels. Finally, an elimination mechanism is developed to merge the two regions’ shared boundaries.

Fig. 1
figure 1

ECPN overall architecture is made up of two parts: MAEC and MEPN. Note that numbers 0–6 in the modified EfficientNetV2 of the MEPN block denote the stage 0–6 of the original EfficientNetV2 network structure

The quality of features typically limits segmentation maps, whereas learning-based approaches detect edges by employing data-supervised algorithms and manually designed features. Fueten et al. [9] modified the output of a standard segmentation procedure. The artificial neural network was used to categorize pixels with varying patterns and color attributes as edge or non-edge. Izadiet al. [10] generated max images from plane light and cross-polarized light images. Then twelve color features extracted from the max images were clustered incrementally to segment the minerals. Rajdeep et al. [11] applied a psychophysics model to obtain the binary segmented output and utilized a k-means clustering algorithm with the selected threshold to generate the final segmentation map.

Recent advances in CNN backbone architecture, such as VGGNet [12], GoogLeNet [13], and ResNet [14], have resulted in remarkable gains for computer vision tasks. Numerous works, including MobileNet [15, 16], Xception [17], Densenet [18], and EfficientNet [19], have made major strides in attaining more accuracy through effective model architectures with lower complexity and better performance, rather than larger and more complicated networks.

Discussing more detail of the most relevant CNN approaches to our work, HED [20] is the first CNN-based edge detection network that provides state-of-the-art performance as a deep supervision extension of FCN [21]. Following that, other efforts [22,23,24,25,26,27,28,29] continued to set records using boosted encoders and altered backbones. RCF [22] enhanced HED’s skip-layer structure and tested it using image pyramids; BDCN [23] proposed an algorithm for monitoring each layer independently at a given scale and making use of dilated convolution to generate multi-scale features; Instead of employing a model that was already trained, Dexined [24] modified the backbone and train it from scratch; as a result, they achieved results that were both competitive and satisfactory. A straightforward, light, and effective edge detection architecture is presented by PiDiNet [29]. This design is built on its proposed pixel difference convolutions, which integrate conventional edge detection filters into normal convolutional operations.

CNN-based methods [30,31,32,33] on semantic segmentation of petrographic images are also noteworthy. Noting that semantic segmentation partitions images into the background and several mineral types, whereas edge detection separates images into two categories: the edge or background.

Rubo et al. [30] utilized discrete convolutional filters, neural networks, and random forest classifiers respectively to generate the semantic segmentation maps of petrographic images. Results were evaluated by the 10-fold cross-validation testing and chemical microscopy results. Tang et al. [31] employed a three-layer neural network, which received plane and cross-polarized light images. The generated segmentation maps are composed of several different rocks. Saxena et al. [32] studied the potential of using a convolutional neural network to predict the pore in rock images (two classes) as well as a 10-class semantic segmentation map. Das et al. [33] created the Deep Semantic Grain Segmentation network (DSGSN) to semantically segment XPL and PPL pictures of sandstone into two classes (i.e., grain and background).

In summary, traditional methods usually apply image processing techniques or machine learning methods to extract useful features and then predict segmentation maps, which are usually limited by hand-designed features. As a result, their prediction results are not satisfactory, while data-supervised CNN algorithms enjoy their end-to-end learning-based paradigm to predict more refined segmentation results. However, CNN-based methods are limited by the large data requirements, as there is no publicly available dataset of petrographic images. In addition, both previous image processing methods and CNN-based approaches ignore the intrinsic properties of mineral grains and their specific optical properties in petrographic images, such as extinction phenomena, which are widely used by geologists for mineral naming and identification. Furthermore, the majority of earlier models typically input a few rock images (plane or cross-polarized, 1–5 input images) without considering the intrinsic connection of the input images.

Theoretical background and methodology

This section discusses the extinction consistency perception network (ECPN), the proposed method for edge detection in thin-section petrographic images that takes a succession of petrographic images and predicts the grain’s edge map. The ECPN model may be subdivided into two sub-networks (as seen in Fig. 1): (1) the multi-angle extinction consistency (MAEC) block is used to fuse the input of seven image patches into a three-channel feature map that is based on the continuous extinction phenomena; (2) the multi-scale edge perception network (MEPN) to enhance feature extraction and strong feature aggregation.

Multi-angle extinction consistency block

The multi-angle extinction consistency (MAEC) block is used to augment pixels within the same particles into an enhanced feature by capturing the extinction phenomena of grains in petrographic pictures with multiple angles of cross-polarized light.

This block is inspired by the tradition of petrologists manually rotating polarizers to observe multiple sequential images to detect and partition grains. Furthermore, a large number of prior studies [2, 5,6,7,8, 10, 31] used multiple images to determine the grain boundaries. To provide the model with additional information about the grains’ edges, we construct the input of this block as seven various angles of polarized petrographic images, which are illustrated in Fig. 5.

Given a stack of input images X, it has seven image patches \(X\in \mathbb {R}^{m\times n\times 21}\), which are separated into three portions of \(X_\textrm{R}\in \mathbb {R}^{m\times n\times 7}\), \(X_\textrm{G}\in \mathbb {R}^{m\times n\times 7}\) and \(X_\textrm{B}\in \mathbb {R}^{m\times n\times 7}\) based on Red, Green, Blue color space. The separated features are then transformed using the proposed extinction consistency enhancement (ECE) block and the squeeze and excitation (SE) block [34], which are aimed to improve feature representation in the spatial and channel dimensions, respectively.

$$\begin{aligned} X_{i}^{\prime } = \textrm{SE}(\textrm{ECE}(X_{i})). \end{aligned}$$
(1)

The ECE block outputs an enhanced feature and is explained in “Extinction consistency enhancement block”. The SE block has proved to be an excellent channel attention operation, enabling the model to automatically learn the value of channel properties. It is divided into two sections: Squeeze and Excitation. The squeeze operation compresses the features by leveraging global average pooling on the feature map layer; the excitation operation generates the weight for each channel using a two-layer bottleneck structure.

The enhanced feature map is used as the input for the next one-layer convolution operation, which condenses the channel of the features into one channel indicating the color space features. Finally, the R, G, and B color space features are combined to create improved grain edges.

$$\begin{aligned} X^{\prime \prime } = \textrm{concat}(W_\textrm{R}X^{\prime }_\textrm{R},W_\textrm{G}X^{\prime }_\textrm{G},W_\textrm{B}X^{\prime }_\textrm{B}), \end{aligned}$$
(2)

where \(W_\textrm{R}, W_\textrm{G}\) and \(W_\textrm{B}\) denote the one-layer convolution transform matrix.

Extinction consistency enhancement block

Extinction is a term used in optical mineralogy and petrology, that refers to the dimming of cross-polarized light, as viewed through a thin section of a mineral under a petrographic microscope [35]. During the extinction, the value of pixels (R, G, B, and gray value) within a grain change continuously, whereas pixels belonging to adjacent grains change differently, which is referred to as the Extinction Consistency in this paper. We create the extinction consistency enhancement (ECE) block to enhance features within the same grains based on the extinction phenomenon as a beneficial reference for grain edge detection. For each pixel \(\left( i,j \right) \) belongs to the input of the MAEC, let \(F_{({i,j})}\in \mathbb {R}^{1 \times 1 \times 7}\) denotes the value vector and \(F_{({i - x,j - y})} \in \mathbb {R}^{1 \times 1 \times 7}\) presents the value vector of its nearby pixels where \(\left( {x,y} \right) \in \Omega = \left\{ {\left\{ {- 1, - 1} \right\} ,\left\{ {- 1,0} \right\} ,\left\{ {- 1,1} \right\} ,\left\{ {0, - 1} \right\} } \right\} \). The below Fig. 2 illustrates the location distribution of pixel \(\left( i,j \right) \) and pixel \(\left( {i - x,j - y} \right) \).

Fig. 2
figure 2

a The location distribution of pixel \(\left( i,j \right) \) and pixel \(\left( {i - x,j - y} \right) \); b illustration of the proposed MAEC block

Since the pixels within the same rock particles have the same trend of continuous extinction, the more similar the value vector \(F_{({i,j})}\) of one pixel to its nearby pixels, the more likely they belong to the same particles. For this reason, we hope a function \(f\left( \cdot \right) \) is able to measure the similarity between \(F_{({i,j})}\) and \(F_{({i - x,j - y})}\).

Considering the amount of computation, we simply apply the inner product between \(F_{({i,j})}\) and \(F_{({i - x,j - y})}\) as the similarity measurement \(f\left( \cdot \right) \). Besides, with the aim of better representation of features, the value vectors are firstly transformed by the linear transformations and then take the inner product between the transformed features to calculate their similarity:

$$\begin{aligned} f\left( {F_{({i,j})},F_{({i - x,j - y})}} \right)= & {} \left( {W_{({i,j})}F_{({i,j})}} \right) ^{T}\nonumber \\{} & {} \left( {W_{({i - x,j - y})}F_{({i - x,j - y})}} \right) , \end{aligned}$$
(3)

where \(W_{({i,j})}\) and \(W_{({i-x,j-y})}\) are the weight matrix for linear transformation.

The softmax function is utilized to normalize the calculated similarity score where \(\alpha _{({i - x,j - y})}\) intuitively measure how similar between the pixel \(\left( i,j \right) \) and pixel \(\left( i-x,j-x \right) \) are and can also be seen as the weight:

$$\begin{aligned}{} & {} \alpha _{({i - x,j - y})} = \frac{\exp \left( {f\left( {F_{({i,j})},F_{({i - x,j - y})}} \right) } \right) }{\sum _{{({x,y})} \in \Omega }{\exp \left( {f\left( {F_{({i,j})},F_{({i - x,j - y})}} \right) } \right) }}. \end{aligned}$$
(4)

In the next step, we fuse the nearby pixels’ transformed value vector \(W_{({i - x,j - y})}F_{({i - x,j - y})}\) by multiplying and adding them based on weights:

$$\begin{aligned}{} & {} F_{({i,j})}^{\prime } = \sum _{{({x,y})} \in \Omega } \alpha _{({i - x,j - y})}W_{({i - x,j - y})}F_{({i - x,j - y})}. \end{aligned}$$
(5)

Finally, the aggregated feature \(F_{({i,j})}^{\prime }\) is concatenated with the value vector \(F_{({i,j})}\) and the ReLU activation function is employed to obtain a powerful feature representation \(F_{({i,j})}^{\prime \prime }\) of pixel \(\left( i,j \right) \):

$$\begin{aligned}{} & {} F_{({i,j})}^{\prime \prime } = \textrm{ReLU}\left( \textrm{concat}\left( {F_{({i,j})},F_{({i,j})}^{\prime }} \right) \right) . \end{aligned}$$
(6)

For the purpose of the computation reduction, all calculations are conducted in parallel and feature vectors \(F_{({i-x,j-y})}\) are generated by the accelerating scheme proposed by Dai et al. in [36].

Multi-scale edge perception network

The multi-scale edge perception (MEPN) network can be broken down into two distinct components: (1) the encoder developed from EfficientNetV2 had a series of stages for extracting grain edge information with different scales effectively. (2) The BiDecoder was built to make use of the multi-level characteristics and make an accurate prediction of the edge map.

Efficient encoder: We customized the EfficientNetV2 as the encoder component of MEPN due to its parameter amount and effective performance. The following adjustments are made: (1) applying EfficientNetV2-s as the model with less parameter fit for our small dataset; (2) utilizing features maps from stages 1–6 (except stage 4) as the inputs of the BiDecoder block with ascending receptive fields. Stage 4 is not used because it is the first stage to utilize the MBconv6 block and its capabilities are not stable and informative. (3) replacing the final classifying section into a semantic segmenting stage.

BiDecoder: It is vital and necessary to merge different scale features to generate an accurate edge map, while they are less inefficient due to the fact that features only migrate from small to large feature spatial maps. Recently, He et al. [23] utilized the loss function to ensure bi-directional feature flow, which empowers the model to predict multi-scale edge segmentation maps. Inspired by this, we attempted to enhance feature representation from the standpoint of decoder structure by utilizing both top-down and bottom-up information circulation channels and adding an extra skip link.

Fig. 3
figure 3

The detailed architecture of BiDecoder. The input features from stages 1, 2, 3, 5, and 6 with ascending receptive fields and the output is concatenated hierarchical features

As illustrated in Fig. 3, the BiDecoder is connected to the above encoder’s five side outputs \(h_{i}, i=1,2,3,4,5\). Then each input \(h_{i}\), except the bottom one with the smallest spatial size (\(h_{1}\)), is added with the features downsampled from transposed convolution (the yellow square). This is followed by a separable convolution block (the pink square) and a Batch Normalization operation.

$$\begin{aligned} h_{i}^{1} = {\left\{ \begin{array}{ll} h_{i} &{} i=1 \\ \textrm{BN}(\textrm{SepCov}(h_{i} + \textrm{Trans}(h_{i-1}))) &{} i=2,3,4,5 \\ \end{array}\right. } \end{aligned}$$
(7)

After that, the features \(h_{i+1}^{'}\) with larger spatial size were downsampled and re-added to feature \(h_{i}^{'}, i=1,2,3,4\). Additionally, there is a skip connection (the orange arrow) between the feature \(h_{i}, i=2,3,4\) and feature \(h_{i}^{1}, i=2,3,4\) intended to improve feature propagation. Then a series of operation transform and upsample the generated features which can be formulated as follow:

$$\begin{aligned} h_{i}^{''} = {\left\{ \begin{array}{ll} \textrm{Trans}(\textrm{Conv}(\textrm{BN}(\textrm{SepCov}(h_{i}^{'} + \textrm{Conv}(h_{i+1}^{'}))))) &{} i=1 \\ \textrm{Trans}(\textrm{Conv}(\textrm{BN}(\textrm{SepCov}(h_{i}^{'} +\textrm{Conv}(h_{i+1}^{'}))) + h_{i})) &{} i=2,3,4 \\ \textrm{Trans}(\textrm{Conv}(h_{i}^{'})) &{} i=5 \\ \end{array}\right. }.\nonumber \\ \end{aligned}$$
(8)

Finally, all features from different levels are concatenated together as the final result. It’s worth mentioning that the last convolution layer compresses the channel into one, followed by the Sigmoid activation to generate the final two classes (edge and background) segmentation map.

$$\begin{aligned} \textrm{Output} = \textrm{Sigmoid}(\textrm{Conv}(\textrm{BN}(\textrm{Concat}(h_{i}^{''})))). i=1,2,3,4,5.\nonumber \\ \end{aligned}$$
(9)

Distance map penalized compound loss function

Grain edge segmentation is a challenging class imbalance problem because edge pixels account for less than 10% of total pixels, with non-edge pixels accounting for the majority of pixels. As a result, we first inherit the extensively used class-balanced cross-entropy loss function [20, 22,23,24,25,26,27,28,29]. Let X refer to the input image patch stack, \(Y\in \left[ 0,1\right] \) as the ground truths, and \(Y^{\prime }\) as the predicted grain edge map. \(Y^{\prime } = \left[y_{1}^{\prime },~y_{2}^{\prime },\ldots ,~y_{N}^{\prime }~ \right]\) , where \(y_{i}^{\prime } \in \left[0,1 \right]\) represents the probability that a pixel belongs to an edge and N denotes the number of pixels in the predicted edge map. The following equation describes the class-balanced cross-entropy loss function:

$$\begin{aligned}&l_{1}\left( X,W \right) = -\beta \sum \limits _{i \in Y^{+}} \log \sigma \left( y_{i}^{\prime } = 1 \mid X;W \right) - \left( 1 - \beta \right) \nonumber \\&\quad {\sum \limits _{i \in Y^{-}} \log \left( y_{i}^{\prime } = 0 \mid X;W \right) }, \end{aligned}$$
(10)

where \(\beta = |Y^{+} |/ |Y |\) and \(1-\beta = |Y^{-} |/ |Y |\). where \(|Y^{+} |\) and \(|Y^{-} |\) denotes the edge and non-edge ground truth pixel sets, \(\sigma ({\cdot })\) represents the sigmoid activation function, and W defines the set of all network parameters.

The main drawback of \(l_{1}\left( X,W \right) \) is that it only penalizes differences between two statistical distributions based on the entire image, ignoring discrepancies in the regions where the predicted segmentation map Y and the ground truth Y overlap. To add additional information on the edge, we introduce a distance map penalty loss function that directs the network to concentrate on the object boundaries during training. The equation that follows explains how to define it:

$$\begin{aligned}{} & {} l_{2}\left( {X,W} \right) = - \sum _{i = 0}^{N} D_{i}\log \sigma \left( {y_{i}^{\prime }}\mid X;W \right) , \end{aligned}$$
(11)
$$\begin{aligned}{} & {} D = \alpha \left( ~1/\left( {D^{\prime } + \epsilon } \right) \right) , D_{i} \in D,~i \in \left\{ 1,2,\ldots ,~N \right\} , \end{aligned}$$
(12)

where D is the distance penalty mask and \(D_{i} \in D,~i \in \left\{ 1,2,\ldots ,~N \right\} \) represents the distance-map penalized weight of pixel i; \(D^{\prime }\) is the distance map, transformed from the inverted ground truth mask, which Euclidean Distance Mapping algorithm [37] is applied. D, \(D^{\prime }\) are illustrated in Fig. 4; \(\epsilon \) is a parameter and is set as 1e−9 with the purpose of avoiding the value of \(\left( {D^{\prime } + \epsilon } \right) \) equal to zero; \(\alpha ({\cdot })\) is an assignment function which is defined as follow:

$$\begin{aligned} \alpha \left( x \right) = \left\{ \begin{matrix} {x,~x \ne 0} \\ {1,x = 0} \\ \end{matrix} \right. \end{aligned}$$
(13)
Fig. 4
figure 4

Visual display of intermediate results of \(l_{2}\left( X,W\right) \). a The ground truth Y; b the inverted ground truth; c the distance map \(D^\prime \) transformed from the inverted ground truth; d the distance penalty mask D, note that the edge width of D is much wider than the ground truth Y because of the non-zero weight near edges

The proposed distance map penalized loss function \(l_{2}\left( {X,W} \right) \) is coupled with the class-balanced cross-entropy loss function \(l_{1}\left( X,W \right) \), so as to reduce the training instability. Therefore, the distance map penalized compound loss function \(l \left( X,W \right) \) is shown in the following equation:

$$\begin{aligned}{} & {} l\left( {X,W} \right) = ~l_{1}\left( {X,W} \right) + ~l_{2}\left( {X,W} \right) . \end{aligned}$$
(14)

Evaluation metrics

We utilize the same F-measure and accuracy (A) that were specified in classic studies [6] to evaluate our proposed model. The evaluation metrics are as follows:

$$\begin{aligned}{} & {} P = \textrm{TP}/\left( \textrm{TP} + \textrm{FP} \right) , \end{aligned}$$
(15)
$$\begin{aligned}{} & {} R = \textrm{TP}/\left( \textrm{TP} + \textrm{FN} \right) , \end{aligned}$$
(16)
$$\begin{aligned}{} & {} F = \frac{2PR}{\left( \textrm{P} + \textrm{R} \right) }, \end{aligned}$$
(17)
$$\begin{aligned}{} & {} A = \frac{\textrm{TP} + \textrm{TN}}{\textrm{TP} + \textrm{TN} + \textrm{FP} + \textrm{FN}}, \end{aligned}$$
(18)

where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively, and P and R denote precision and recall.

Given an edge probability map, a threshold is used to produce the predicted map. If the probability is above this threshold, the pixel will be assumed as the true edge and vice versa. There are two ways to set this threshold. One is called the optimal dataset scale (ODS) which employs a fixed threshold of all test images and the other is known as the optimal image scale (OIS) which calculates an optimal threshold for every image. In this paper, we use the F-measure of both ODS and OIS to measure the segmentation performance.

The F-measure penalizes the model according to a statistical distribution. The average IOU is used as an additional quantitative model performance indicator to monitor the area where the ground truth and predicted grain edge maps overlap.

$$\begin{aligned}{} & {} IoU = \frac{1}{N}\sum \limits _{i = 1}^{N} \frac{G_{i}\bigwedge P_{i}}{G_{i}\bigvee P_{i}}, \end{aligned}$$
(19)

where \(P_{i}\) and \(G_{i}\) are the predicted edge map and ground truth edge map pixels, respectively, and N is the total number of pixels in the segmentation edge map.

Materials and implementing methods

Cross-polarized petrographic image datasets (CPPID)

The carefully annotated cross-polarized petrographic image dataset for grain edge detection is one of the paper’s key contributions. These datasets were collected using a CAIKON XP-330C polarizing microscope equipped with a Daheng Image MER-2000-19U3C camera. CPPID includes five different kinds of thin sections of rock. Each rock thin section was captured using 25 different fields of view, with images being taken at seven different polarization angles (0\(^{\circ }\), 15\(^{\circ }\), 30\(^{\circ }\), 45\(^{\circ }\), 60\(^{\circ }\), 75\(^{\circ }\), and 90\(^{\circ }\)) for each field of view. There are 875 petrographic images total in the dataset, each measuring \(4666 \times 3672 \times 3\). Table 1 displays detailed information on the original dataset.

Table 1 The introduction of cross-polarized petrographic image datasets (CPPID)

Preprocessing

To increase the perceptual field of the CNN and account for the GPU hardware’s memory constraints, we utilized several pre-processing techniques on the raw dataset, such as image resizing, image patch cropping, and train set and test set division.

We first decreased the data size of inputs to set a high level of batch size to accelerate the training of the model. Given the relatively high resolution of the original image, we reduced the data size by sampling it to a quarter of its original size (2048 \(\times \) 1536 \(\times \) 3) without losing the image’s main information. Since the down-sampling image is still of high resolution, 35 image patches (512 \(\times \) 512 \(\times \) 3) were obtained from the down-sampling image by window sliding clipping operation. With each image patch set containing seven image patches with various polarization angles, we were able to obtain 4375 image patch sets. Another advantage of the window sliding clipping operation is that the object edges are low-level semantics, so unlike other high-level semantic tasks, this operation on the edge detection dataset does not generally introduce image information loss and model accuracy degradation (e.g. segmentation, detection, etc.).

Then the training and test sets were divided in a 4:1 ratio on the image patch sets, with the former being used for model training only and the latter for model testing only. Considering the small number of datasets, we discarded the validation set and instead used the test set to supervise the model training metrics at the same time. The relative ratio of the test set to the training set between 10 and 30% is reasonable. The influences of the above-mentioned operations on datasets can be found in Table 1.

Finally, with the help of Southwest Jiaotong University (SWJTU) petrologists, the grain edges were also meticulously labeled, and the images were downscaled and cropped using the same process.

The generation of the max image

References [1, 2, 5, 9, 10] made use of Max petrographic images since they provided more information on the grain edges. Since there isn’t a CNN model specifically created for grain edge detection, we used this type of image as input to a conventional natural image model for edge detection to compare it with the proposed framework. The fused maximum image, the image patches with the seven different polarization angles, and their corresponding ground truths are shown in the following Fig. 5.

Fig. 5
figure 5

Graphical representations of CPPID dataset. From top to bottom, each line represents quartz sandstone, micritic limestone, coarse-grained quartz sandstone, magnetite quartzite, and quartzite respectively. ag 0\(^{\circ }\), 15\(^{\circ }\), 30\(^{\circ }\), 45\(^{\circ }\), 60\(^{\circ }\), 75\(^{\circ }\), and 90\(^{\circ }\) polarized light image patches; h fused max image; i ground truth

Implementation details

Pytorch was utilized to carry out the implementation. Using the Adam optimizer with a batch size of 4, the model was able to converge after 120 epochs. Note that the relatively small batch size 4 was set because of the limitation of GPU resources. A larger batch size may benefit the model with better accuracy and faster convergence. However, it is important to note that the learning rate needs to be adjusted accordingly when changing the batch size. The initial learning rate was 1e−6 then adjusted using a Warm-start Cosine Annealing schedule [38], with a minimum learning rate of 1e−9 and parameters \(T_{0}\) and \(T_\textrm{mult}\) set to 3 and 2, respectively. The official learning rate of EfficientNetV2 setting is 1e−5. During hyperparameters searching, we inherited it first and tuned it lower. Besides, we also tried other learning rate schedules including CosineAnnealingLR and MultiStepLR which resulted in lower model indicators after hyperparameters searching. To prevent the tendency of overfitting, we adopt several data augmentation operations on the input and its corresponding label map, including random horizontal, vertical, and mirror flip, as well as random zooming and rotation.

With an input image patch size of 21 \(\times \) 512 \(\times \) 512 \(\times \) 3, the training process took about a day on an RTX 3080 GPU. The ImageNet pre-training models from the Timm package initialized the modified EfficientNetV2 weights while randomly initializing the rest of the model. The official version pre-trained on BSD500 or BIPED was utilized as the comparison model, which was then fine-turned on our dataset. Except for the learning rate, all hyperparameters are consistent with the original papers since the convergence on our proposed CPPID dataset generally requires a lower learning rate.

Postprocessing

The predicted segmentation maps from image patches were directly merged to form the overall segmentation result, with a size of 2048 \(\times \) 1536 \(\times \) 3. Note that since the training and validation datasets were divided randomly, the image patches used in the complete segmentation results consist of both training and validation image patches. The post-processed overall segmentation results are displayed on https://github.com/ELOESZHANG/ECPN, taking into account the layout and length of the article.

Experiments and results

Quantitative results: evaluation indicators outperforms other state-of-art methods

In contrast to the proposed model, we chose several classical models that are specifically designed for edge detection of natural images which employ a single image as input. Since the input to our model is a set of petrographic images obtained from various cross-polarized light, our multi-image input is unsuitable for traditional edge detection models. Therefore, the max image generated from several cross-polarised images, which has been widely used in previous papers, was employed to train the comparison model. To ensure the validity of the comparison experiments, ECPN-Max is designed to display the results of the model only trained by the max image.

The comparing results with state-of-art methods are summarized in Table 2. In comparison to other experiments, ECPN-Max outperforms those classical models trained also by the max image, demonstrating the effectiveness of our method. Besides, the full version of our proposed model (ECPN-Multi-MAEC-MEPN-DMPC) ranks best for most of the indexes, except for the R-value. The PR curve obtained by comparing these methods is presented in Fig. 6a, where we can figure out that our proposed model gets the best result.

Fig. 6
figure 6

Precision–recall curves on CPPID dataset. a PR curves of our model and some competitors; b PR curves of our models with different ablation study versions

Table 2 Evaluation indicators for the proposed ECPN compared with the state-of-the-art methods

Another vital factor for evaluating models is execution time. All models were evaluated during testing on the same computer running on the Nvidia RTX 3080. Note that the execution time in Table 2 was calculated by averaging the execution times of 50 validation sets of image patches because it can be affected by random noise therefore it is not unique. From Table 2, we can find that the XDoG method surpasses the other five learning-based methods and one traditional image processing approach (SE), achieving an FPS of 32.78. Our proposed model obtains the FPS of 13.33, which is acceptable given that ECPN generates far more accurate edge prediction maps than other comparable models.

Quantitative results: evaluation indicators per category and analysis of model generalization ability

To compare the generalization ability of the model for different kinds of mineral thin sections images, we selected five typical indicators(F1-score, Mean IoU, Precision, and Recall) for analysis. Table 3 displays the corresponding metrics measured for each rock category.

We can observe from the Table 3 that the indicators of Quartize are higher around 0.1 compared to the other four mineral thin sections, indicating that it seems easier for the model to segment than the other four classes. Besides, the fact that the indicators do not significantly vary between categories further proves that the model is not over-fitted on a particular category. In summary, we can conclude that the model generalizes effectively across all categories to predict a precise edge map without preference.

Table 3 Evaluation indicators per category for the ECPN edge predictions
Fig. 7
figure 7

Comparison of grain edge segmentation results on CPPID datasets. 0\(^{\circ }\), 30\(^{\circ }\), and 60\(^{\circ }\) represent the polarized angle of image patches. These three angles of images are selected for display by the limits of the page

Graphic results

Qualitative observations are presented in Fig. 7, which compares predicted grain edge maps to competing model predictions. The segmentation result clearly outperforms those obtained by other methods due to its excellent performance in avoiding edge blurring and under-segmentation in difficult-to-segment regions (see row 1 in Fig. 7). Furthermore, while avoiding the over-segmentation of impurities in the grain (see line 2 in Fig. 7), ECPN can also produce reliable segmentation results. Moreover, ECPN predicts finer edges than others (see line 3 in Fig. 7), demonstrating the proposed model’s good convergence.

Ultimately, our proposed model demonstrates its great benefits by avoiding three challenging issues of edge detection in petrographic images, including over-segmentation of grain impurities, under-segmentation of grain edges, and generation of thicker edges.

Ablation study I: deep supervision trick impairs training efficiency

Deep supervision, a trick commonly used in edge detection papers [20, 22,23,24,25,26,27,28,29], is referred to as the calculation of loss function utilizing both side-output predictions and last output prediction. Row 1 and 2 in Table 4 show the performance of models trained with or without deep supervision where it’s clear to see that deep supervision impedes the model performance. For this reason, we abandon it in our formal model. The PR curve of this test is illustrated in Fig. 6b. Meanwhile, we observe empirically that it impairs the training process’s efficiency.

Table 4 Ablation study on a model trained with or without deep supervision

Ablation study II: the multi-angle extinction consistency (MAEC) block enhances model performance

The proposed multi-angle extinction consistency (MAEC) block is designed to capture the grain extinction consistency of sequential cross-polarized thin-section petrographic images with different polarizing angles. Then fuse the pixel with its surrounding 4-neighbor pixels to generate an edge-enhanced feature. Due to the fact that the MAEC block has a negligible effect on model parameters, we designed the ablation study to evaluate its performance by simply using or discarding it. Figure 6b depicts the PR curve for this experiment, and Table 5 summarizes the ablation results. Row 1 and 2 in Table 5 show the performance of models trained with or without MAEC block where we can conclude that the MAEC block is beneficial for the model performance and increases the Mean IOU around 2.6\(\%\).

Table 5 Ablation study on model with or without MAEC

Besides, Table 6 is provided to compare the proposed MAEC block with related works. We conclude that there are three different approaches for fusing polarized multi-angle images of rock-thin sections. Earlier works [3, 4, 6,7,8, 11] perform edge detection on a single image and then results were fused during the post-processing stage, which is time-consuming and unsatisfactory. Other related works [1, 2, 5, 9, 10] utilize the technology named max image for the fusion of multi-angle images input that takes comparatively less time but involves more manual design, thus the result is also subpar. In contrast to previous methods, the proposed MAEC attempts to employ convolutional neural layers to implicitly learn the relationship between images taken from different polarization angles, utilizing the prior named extinction consistency to narrow the features of mineral particles belonging to the same categories, therefore enhancing and fusing the input features.

Table 6 Comparison of the proposed MAEC block with existing methods for the operation of multi-angle polarized images

Ablation study III: the BiDecoder proves the effectiveness of bi-direction feature flow

To demonstrate whether the proposed BiDecoder is effective, we transform the way how deep layers and shallow layers are linked. We design four different types of decoders in this part (see Fig. 8 for details), namely the General decoder, the up-to-down decoder, the bottom-to-up decoder, and the BiDecoder. As implied by the name, the Bicoder denotes bidirectional feature propagation, while up-to-down and bottom-to-up decoders indicate features that can only flow from shallow to deep layers or from deep to shallow layers, respectively. The general decoder is constructed without any connections.

Fig. 8
figure 8

Four types of the decoder for ablation study. a General decoder; b up-to-down decoder; c bottom-to-up decoder; and d BiDecoder

As shown in Table 7, unsurprisingly, the general decoder had the worst results (mIoU 55.0%), demonstrating the necessity and rationality of multi-level feature fusion. The up-to-down and bottom-to-up decoders rank third and second, achieving mIoU 62.7 and 76.2% respectively, indicating that features flowing from shallow to deep layers are more beneficial for information fusion than features flowing the other way around. Finally, the BiDecoder outperformed other architectures, ranking first (80.3% mIoU) among three different types of decoders, demonstrating how the bidirectional feature flow design enhances the decoder’s capacity for feature extraction. We can therefore conclude that the BiDecoder strengthens model performance and that the unique bidirectional feature propagation architecture is effective and successful.

Table 7 Ablation study on the design of decoder
Table 8 Ablation study on the loss function

Ablation study IV: the distance map penalized compound (DMPC) loss function strengthens the inception of edges

Class-balanced cross-entropy loss function \(l_{1}\left( X,W \right) \) is commonly used in paper [20, 22,23,24,25,26,27,28] because the edge detection task is a severe class-imbalanced task. Instead of penalizing the dissimilarity between two statistic distributions of the whole picture, we propose the distance map penalized loss function \(l_{2}\left( X,W \right) \) guiding the model based on the mismatches of overlap regions between ground truth and predicted segmentation map. The distance map penalized compound (DMPC) loss function \(l\left( X,W \right) \) coupled the \(l_{1}\left( X,W \right) \) and \(l_{2}\left( X,W \right) \) to reduce the training instability. The effectiveness of the proposed DMPC loss function is shown in Table 8, where we can observe that it increases the edge detection results by 0.5%. Besides, the PR curve of this experiment is shown in Fig. 6b.

Conclusion and future work

An extinction consistency perception network (ECPN) for grain edge detection in cross-polarised petrographic images is presented in this paper. The proposed multi-angle extinction consistency (MAEC) module enables the model to depict the extinction properties of mineral grains in petrographic images and fuse the inputs into edge-enhanced features. To obtain a multi-scale representation of edge features at different levels, we utilize a multi-scale edge perception network (MEPN) composed of a modified EfficientNetV2 and our proposed BiDecoder. Finally, we introduced a distance map penalized compound (DMPC) loss function to guide the model to pay more attention to edges during the training. Besides, A cross-polarized petrographic image dataset called CPPID is collected and available to the public. In comparison to several classical edge detection models on CPPID, our method achieves 0.940 ODS and 0.941 OIS, outperforming other models by approximately 15%.

There are still some issues even though the proposed model generates precise rock grain edge prediction maps in terms of both subjective image results and objective metrics. With the current CPPID dataset, the model is limited to a few petrographic images for edge detection tasks. Besides, given the relatively small dataset, there is a tendency for the prediction of edges map overfitting a few unreasonable edge labels, i.e. over-fitting introduces dataset bias. Further, the MAEC module, developing for multi-polarized petrographic image fusion, was only designed with the intention of observing the consistency of multiple cross-polarized petrographic images, which is relatively simple.

Future research could take into account the crystalline nature of different mineral grains and introduce multi-modal information fusion between single and cross-polarised petrographic images, which may also facilitate the accurate prediction of grain edges. In the future, it is hoped that more publicly available datasets containing a wider variety of petrographic images will become available to eliminate bias and improve model generalisability. Furthermore, more exploration on enabling the model zero-shot ability to out-of-domain data may be a novel direction for the petrographic dataset lacking problem.