A data analysis procedure for phase identification in nanoindentation results of cementitious materials

6 Measuring accurately phase properties is essential for a realistic mesoscale modeling of materials, and 7 nanoindentation is a popular technique regarding mechanical properties. Given the statistical nature 8 of the grid indentation method, where large arrays of indents are performed blindly, the identification 9 of phases from the distributions of measured properties is an essential step. Many biases can be 10 introduced at that stage when the phases do not have very distinct properties as is often the case for 11 cementitious materials, since many indentation tests may also be in effectively heterogeneous areas. 12 It is proposed in the present work to analyze statistical indentation results on cementitious materials 13 with a hierarchical clustering algorithm making use of enriched information, including the spatial 14 coordinates of the indent. It is shown that it allows to reduce potential biases of the method by 15 eliminating tests in potentially heterogeneous areas and performing model independent identification 16 of the different phases.

A data analysis procedure for phase identification in nanoindentation results of cementitious materials

Introduction 20
Nanoindentation provides an efficient method to probe the mechanical properties of materials at 21 microscopic length scales. By applying force using an indenter of known geometry and properties onto 22 the surface of a sample, the obtained force-depth response is analyzed through contact mechanics in 23 order to commonly extract the local elastic modulus and hardness. In the context of cementitious 1 materials, nanoindentation is being widely used, for example, to study Interfacial Transition Zones 2 [1,2], chemical degradation [3] or carbonation [4], or time-dependant (creep) properties [5][6][7][8]. By 3 directly measuring the elementary phases properties [9-13] nanoindentation allows for a direct 4 experimental input to upscaling methods [14] that aim to increase their predictive capabilities in 5 deriving effective mechanical properties. 6 The now standard phase identification procedure from nanoindentation data relies on the 7 identification of a Gaussian Mixture Model (GMM) on the 2D distribution of indentation mechanical 8 properties [15,16]. More descriptors such as chemical information can be added in these classification 9 methods to identify phases more reliably coupling EDS and nanoindentation, where one can use in the 10 procedure atomic ratios combined with indentation parameters [17][18][19]. For the standard approach 11 making use only on micromechanical measurements, some controversy has arisen on the capabilities 12 of the statistical indentation method to determine properties of single phases in cement paste [20,21], 13 since the fitting of GMM presents many local minima [22]. More generally, it has been notably argued 14 that in the nanoindentation testing of cementitious materials, probed volumes smaller than the 15 heterogeneity length scale (possibly in pure phases) are too small relatively to the minimal obtainable 16 roughness given by the porosity (where adequate contact conditions are fulfilled). 17 In view of these issues, it is proposed in the present work to take advantage of the spatialized nature 18 of the data provided by nanoindentation maps in order to reduce these biases commonly introduced 19 in the post-processing stage of statistical nanoindentation. It is first argued that the standard analysis 20 of nanoindentation data based on Gaussian mixture models inevitably introduces biases when phases' 21 mechanical properties do partially overlap. A post-processing method using more fully the obtained 22 experimental data is proposed: after the removal of indentation tests where a local homogeneity 23 criterion is violated, a hierarchical clustering algorithm making use of enriched and spatialized data is 24 where 1 is the indentation modulus of phase "1", its volume fraction in the probed volume and a 1 measurement error. One may check that using the Reuss (uniform stress) bound does not yield 2 significantly different conclusions for the range of parameters chosen here. This random variable is 3 obtained from the random variables 1 , 2 and , which will assumed to follow a normal distribution 4 and which is assumed to follow a beta distribution, with probability density: 5 with the beta function which is the proper normalization constant. Parameters and define the 7 shape of this distribution. The beta distribution allows for general modelling of a continuous random 8 variable on [0,1] and is therefore well adapted to the statistical behavior of proportions. In our case, 9 we assume that indents in nearly homogenous areas are highly probable ( very likely to be close to 0 10 or 1) and therefore ≪ 1 and ≪ 1. Phase "1" is also assumed to be of higher volume fraction than 11 phase "2" (hence, is more likely to be close to 1 than 0) and therefore < . 12 As a numerical application, we assume phases with mechanical properties similar to the two usually 13 Datasets with 1000 modulus measurements are then generated as they appear empirically to be a 4 reasonable amount of data to differentiate phases in cementitious materials. A two-component 5 Gaussian mixture model is then fitted to this data with the usual EM algorithm, as it can be observed 6 for a dataset in Figure 2. One expects to recover the correct means of the 1 and 2 distributions as 7 the two means of the Gaussian mixture components, at least as a limit for a large number of repeated 8 experiments. For 1000 experiments each containing 1000 indents, the fit of a Gaussian mixture model with two 1 components is performed. The distribution of the means of these Gaussian curves is represented in 2 Figure 2: it can be observed in particular that these means generally underestimate the means of the 3 original phase distributions; finding the correct average value for the stiff phase modulus has actually 4 almost zero probability. The most likely value is around 25.4 GPa, to be compared to the expected 5 28 GPa; the estimated weight (volume fraction) of the soft phase is approximately 55% as compared 6 to the expected 67% (as defined as the value of the cumulative distribution function of at 50%). 7 However, one may check that constraining either the weights (phase volume fractions) or means of 8 the Gaussian components would yield correct values for the free parameters, which is characteristic 9 of an underdetermined problem. As a conclusion, with fairly reasonable assumptions about the 10 statistics of nanoindentation results (two phases, separated with moderate overlap, moderate 11 heterogeneity, low experimental errors), it can be observed that results yielded by Gaussian mixture 12 phase decomposition do not generally converge to the correct average values for the phases 13 properties (both mechanical properties and phase fractions), even for an unrealistic large number of 14 experiments. This type of method may converge to many local minima as shown by [22], and the model 15 parameters at the global minimum do not generally coincide with the sample generating distribution 16 parameters. The potential bias of the method in the case where more than two phases partially overlap 17 (for example in cementitious matrices with portlandite CH around 40 GPa and stiffer anhydrous 18 phases), or have non-Gaussian distributions, may be significantly high. 19

Alternative method : hierarchical clustering applied to nanoindentation 20
In view of the previous results, it is useful to attribute phases to each indent based on the properties 21 of the force-displacement curves themselves instead of identifying phase probability distributions on 22 the whole dataset. With no a priori knowledge about the mechanical properties of each phase, one 23 may use unsupervised clustering algorithms, a class of algorithms aiming to regroup automatically data 24 vectors into classes based on some similarity measure. Moreover, the usual decomposition method 25 makes no use of spatial correlations of the dataset that arises when the experiment is properly 1 designed (indent spacing adequately small -lower than the characteristic heterogeneity size). In 2 particular, indents in heterogeneous areas can be detected as small scale variations in measured 3 mechanical properties, and spatially close areas with similar mechanical properties are likely to be 4 from the same phase. The proposed algorithm attempts to take into account these remarks. Where and are the spatial coordinates of the indent, and the usual indentation modulus and 11 hardness, ℎ the residual depth and the ratio of the elastic strain energy to the total strain energy, 12 that is linked to the compared curvatures of the loading and unloading curves. is calculated as the 13 ratio of the integrals of the force-displacement unloading branch and the loading branch. Although 14 large (non-linear) correlations exist between these quantities, they are not entirely redundant: no data 15 reduction procedure (such as Principal Component Analysis, for example) is attempted. Data remains 16 of reasonable dimensionality, unlike datasets with many descriptors given for example by Acoustic 17 Emission signals [30]. Moreover, using an enriched dataset relatively to the usual [ , ] is expected to 18 provide additional robustness into the method when dealing with noisy or imperfectly corrected data: 19 is notably invariant through a shift in displacement and insensitive to small errors in zero-point 20 correction. 21 The local relative error of the indentation hardness is computed as the ratio of the standard deviation 22 to the mean of hardness values in a neighborhood (called ): 23 Where the neighborhood is chosen here as the indent itself and its four nearest neighbors. This 2 parameter characterizes the local variability in mechanical properties. The physical size of the 3 neighborhood should be smaller than the characteristic heterogeneity length scale. In order to 4 eliminate probably heterogeneous regions, indents are eliminated from the analysis if ( ) is higher 5 than some arbitrarily defined value; in the present work, indents must verify ( , ) < 25%. 6 Finally, since components are of different magnitudes and are to be compared, each measured 7 quantity is normalized such that the whole dataset is of mean 0 and variance 1 for each component. 8

b. Clustering algorithm 9
The selected clustering algorithm is of agglomerative hierarchical type [31]: with initially each 10 observation belonging to its own cluster ("singlet"), clusters are successively merged until the final 11 number of required clusters is reached. At each step, merging is performed according to some criterion 12 which aims at globally achieving maximal similarity of elements inside each cluster and maximal 13 dissimilarity between clusters (or, geometrically, maximal intra-cluster compacity and inter-cluster 14 separation). We use the weighted Euclidean distance as a similarity measure in the 6D space of data 15 vectors: 16 The weights are introduced in order to adjust the influence the different measured quantities on 18 the similarity measure. All mechanical parameters are treated equally but spatial coordinates are 19 chosen to have a reduced influence when selecting the clusters to merge. Therefore we choose in the 20 present case 1 2 = 2 2 < 1, all other weights being kept to 1. 21 The selected algorithm is Ward's method [32]; at each merging step, one selects the couple of clusters 1 to merge such that the increase in intra-cluster "inertia" is minimal, defined as: 2 where we have at the current step observations in clusters with each elements, and center of 4 gravity (such as defined with the aforementioned distance). For a given dataset and distance, the 5 final cluster hierarchy is unique: it does not depend on any initialization of the algorithm or on the 6 number of clusters to be found. The final number of clusters (where to "stop" the merging process) 7 in the dataset must however be selected or deduced from some clustering quality evaluation 8 parameter (such as [33]). The interest of the method is that one takes into account the fact that indents 9 in the same phases should be similar in mechanical properties but also spatially close; using different 10 wording, two distant points, to be affected to the same cluster, have to exhibit very similar mechanical 11 properties.  Figure 6 and numerical parameters in Table 1. 18 It can be observed for this sample that phases are not clearly separated in the data histograms and 19 that issues of the type presented in 2.b may be expected, as a large number of models may yield fits 20 as adequate as the global optimum presented here. In particular, one may check that using a four 21 component Gaussian mixture yields optima with components of comparable weights and largely 22 overlapping, which constitutes an even worse decomposition of the material phase's properties.  The proposed method is then applied to the analysis of this dataset using empirically defined weights 3 1 2 = 2 2 = 0.2 also implemented from the Python module scikit-learn [37]. In order to calculate in 4 each point the local variability criterion, the outer indents are excluded from the analysis and therefore 5 the studied dataset makes use of 900 indents; moreover the criterion excludes 181 points from the 6 subsequent steps (21% of the dataset). The 719 remaining indents are classified using the hierarchical 7 algorithm described above and results for the = 4 cluster number are represented in Figure 7. The obtained per phase property distributions are "irregular" and unsymmetrical and cannot be 3 accurately fitted with Gaussian distributions. The clustering results for = 4 shows that areas 4 excluded from the analysis are mostly of "weaker" mechanical properties that may correspond to 5 roughness due to local porosity and/or damage, and their sizes are consistent with the porosity sizes 6 observed in the electron microscopy images (few micrometers, Figure 3) and also to isolated indents 7 with high mechanical properties, that may be attributed to minor hydration products or anhydrous 8 phases. The procedure yields necessarily well separated clusters that optimize the prescribed criterion 9 of separation in the space of mechanical properties mixed with some preference for local space 10 compacity. The properties of the four phases are given in Table 2  The size of the studied example area being relatively small, the representativity of the sample may be 3 too weak for definitive results, especially regarding the volume fractions. At the studied high water-4 to-cement ratio, the need to separate very high porosity areas from the commonly defined "outer 5 product (OP)" and "inner product (IP)" C-S-H is confirmed [28] although most of it is already eliminated 6 when considering our criterion of local homogeneity. These very high porosity areas are clustered in 7   given as errors. 5 The result for = 3 is reported in Table 3. The determined clusters possess mechanical properties 6 consistent with those numbered 1 to 3 in cement paste ( Table 2) that were attributed respectively to 7 OP C-S-H, IP C-S-H and CH (portlandite), without the need to separate in the analysis very porous areas, 1 consistenty with the observations of [3]. This application confirms the main interest of the method : 2 phases are unambiguously classified when no obvious separation in the (E, H) histograms exist. 3

Conclusions 4
A classification method based on a classical hierarchical clustering algorithm has been applied to 5 nanoindentation results, using enriched information: to the usual indentation hardness and modulus 6 is added additional mechanical parameters and spatial position of the indent. This method has been 7 shown to: 8 1. Allow an identification of phases on nanoindentation results that is model independent (no 9 need for Gaussian properties distribution or widely separated phases), and identify each point 10 unambiguously (instead of deriving a probability distribution for each phase), 11 2. Allow eliminating points in areas that may be largely porous or heterogeneous. 12 As a perspective, this method can be extended in a straightforward manner to "multichannel" maps 13 obtained using other experimental techniques. It will require the definition of suitable components to 14 the data vector and an updated distance function. It may include compositional information such as 15 elemental ratios (SEM-EDS), density/average atomic number (SEM-backscattered imaging gray level) 16 or molecular composition and structure (Raman microspectroscopy) that would allow an increased 17 reliability of the phase separation as well as establishing correlations between the local measured 18 quantities. If the properties of Gaussian Mixture Models are desirable (smooth distributions of 19 properties) the hierarchical clustering method presented here may also be used to initialize or 20 constrain fit parameters (such as the means) when many local minima are expected. 21

Acknowledgements 22
This work has been carried out in the framework of the CEA-EDF-Areva agreement. The author thanks 23 S. Poyet (CEA) for discussions regarding the manuscript. 24