A data analysis procedure for phase identification in nanoindentation results of cementitious materials

Measuring accurately phase properties is essential for a realistic mesoscale modeling of materials, and nanoindentation is a popular technique regarding mechanical properties. Given the statistical nature of the grid indentation method, where large arrays of indents are performed blindly, the identification of phases from the distributions of measured properties is an essential step. Many biases can be introduced at that stage when the phases do not have very distinct properties as is often the case for cementitious materials, since many indentation tests may also be in effectively heterogeneous areas. It is proposed in the present work to analyze statistical indentation results on cementitious materials with a hierarchical clustering algorithm making use of enriched information, including the spatial coordinates of the indent. It is shown that it allows to reduce potential biases of the method by eliminating tests in potentially heterogeneous areas and performing model independent identification of the different phases.


Introduction
Nanoindentation provides an efficient method to probe the mechanical properties of materials at microscopic length scales. By applying force using an indenter of known geometry and properties onto the surface of a sample, the obtained force-depth response is analyzed through contact mechanics in order to extract the local elastic modulus and hardness. In the context of cementitious materials, nanoindentation is being widely used, for example, to study Interfacial Transition Zones [1,2], chemical degradation [3], carbonation [4], or time-dependent (creep) properties [5][6][7][8]. By directly measuring the elementary phases properties [9][10][11][12][13] nanoindentation allows for a direct experimental input to upscaling methods [14] that aim to increase their predictive capabilities in deriving effective mechanical properties.
The currently standard phase identification procedure from nanoindentation data relies on the identification of a Gaussian Mixture Model (GMM) on the 2D distribution of indentation mechanical properties [15,16]. More descriptors such as chemical information can be added in these classification methods to identify phases more reliably, coupling quantitative SEM-EDS and nanoindentation, where one can use in the procedure atomic ratios combined with indentation parameters [17][18][19]. This method remains however complex in its implementation since quantitative analysis of the elemental composition using EDS has to be performed at the same locations where shallow indentation testing has been performed, with micrometer scale precision. For the standard approach making use only of micromechanical measurements, some controversy has arisen on the capabilities of the statistical indentation method to determine properties of single phases in cement paste [20,21], since the fitting of GMM presents many local minima [22]. More generally, it has been notably argued that for cementitious materials, probed volumes smaller than the heterogeneity length scale (and possibly in pure phases) are too small relatively to the minimal obtainable roughness given by the porosity. This hard limit is in the order of a few hundreds of nanometers (where adequate contact conditions are fulfilled), as shown in [23] (for a high quality surface preparation using a focused ion beam).
In view of these issues, it is proposed in the present work to take advantage of the spatialized nature of the data provided by nanoindentation maps in order to reduce these biases commonly introduced in the postprocessing stage of statistical nanoindentation. Indeed, nanoindentation maps at high resolution (tight indent spacing) allows to resolve details of the microstructure and adjacent indents have a probability to belong to the same phase higher than purely random sampling. It is first argued that the standard analysis of nanoindentation data based on Gaussian mixture models inevitably introduces biases when phases' mechanical properties do partially overlap. A post-processing method using more fully the obtained experimental data is proposed: after the removal of indentation tests where a local homogeneity criterion is violated, a hierarchical clustering algorithm making use of enriched and spatialized data is used. An application to indentation results on a high water-to-cement ratio cement paste is presented.

Analysis of load-displacement data
An instrumented indentation test consists in applying a force load path to the surface of a sample using an indenter with precisely characterized properties, and recording the corresponding displacement response of the indenter tip. This force-displacement data (assuming proper preliminary correction of the indenter's compliance and detection of the contact point) is then subject to an inverse analysis using contact mechanics, often with the Oliver and Pharr method [24]. The hardness H follows its usual definition: where F max is the peak load and A c the contact area under the indenter, obtained through f , the indenter area function with A c ¼ f h c ð Þ. The contact depth h c is obtained by correcting the measured depth h for the displacement of the contact perimeter (sink-in), given by the solution of Sneddon for the indenter elastic contact [25]: where S is the elastic unloading stiffness. The e coefficient takes values 0.72 for conical indenters and 0.75 for spheres at small depths. The contact stiffness S is the measured initial slope of the unloading forcedisplacement curve; it follows from the elastic modulus through the equation: with b a correction coefficient equal to 1 for axisymmetric indenters and E r the ''reduced'' modulus obtained from the indented material moduli E and m, and the indenter tip moduli E i and m i (usually diamond): Although sharp pyramidal indenters are widely used in this context, spherical indentation allows for primarily elastic contact for shallow indents [26] and is preferred here for to perform property mapping at high spatial resolution and with therefore small probed volumes. The scale dependency of the spherical indentation (due to the non-self-similarity of the geometry, which possesses an intrinsic length scale corresponding to the radius of curvature) also allows studying the amount of irreversible strain relatively to elasticity for a given loading path.

Gaussian mixture property decomposition and issues
Given the random nature of the probed areas under the indenter (heterogeneity, residual roughness) the statistical indentation technique is the usual way to analyze nanoindentation data on cementitious materials [15]. The technique essentially consists in fitting the mechanical properties experimental statistical distribution with a weighted sum of Gaussian functions, known as a Gaussian Mixture Model (GMM). This fit is a highly dimensional optimization problem (a two-component GMM in two-dimensions has 7 parameters) commonly solved using the Expectancy-Maximization (EM) algorithm [27]. One may consider a simple numerical experiment based on a simple statistical model for the indentation modulus on a two-phase material. Let us consider that the effective indentation modulus measured in a biphasic area is given by the Voigt (uniform strain) bound [28], to which is added some error term. For a uniform Poisson's ratio, it yields the simple rule of mixtures: where E 1 is the indentation modulus of phase ''1'', f its volume fraction in the probed volume and e a measurement error. One may check that using the Reuss (uniform stress) bound does not yield significantly different conclusions for the range of parameters chosen here. This random variable is obtained from the random variables E 1 , E 2 and e, which will assumed to follow a normal distribution and f which is assumed to follow a beta distribution, with probability density: with B the beta function which is the proper normalization constant. Parameters a and b define the shape of this distribution. The beta distribution allows for general modelling of a continuous random variable on 0; 1 ½ and is therefore well adapted to the statistical behavior of proportions. In our case, we assume that indents in nearly homogenous areas are highly probable (f very likely to be close to 0 or 1) and therefore a ( 1 and b ( 1. Phase ''1'' is also assumed to be of higher volume fraction than phase ''2'' (hence, f is more likely to be close to 1 than 0) and therefore a\b.
As a numerical application, we assume phases with mechanical properties similar to the two usually considered types of C-S-H [29,30] and chose E 1 and E 2 to follow a Gaussian distribution with means 18 and 28 GPa and standard deviation 4 GPa. This assumption matters since we assume phases have an intrinsic dispersion in mechanical properties and that the final statistical dispersion is only partially due to heterogeneous probed volumes. The error e is taken from a centered Gaussian distribution with standard deviation 1 GPa. Finally, coefficients for the beta distribution are taken arbitrarily as a ¼ 0; 017 and b ¼ 0; 035 which results in a probability of f to be in 0; 0:01 ½ of approximately 62% and 28% between 0:99; 1 ½ . Having 10% of indents in somewhat heterogeneous areas is voluntarily an extremely optimistic model that assumes that the experiment is properly designed [22].
Datasets with 1000 modulus measurements are then generated as they appear empirically to be a reasonable amount of data to differentiate phases in cementitious materials. A two-component Gaussian mixture model is then fitted to this data with the usual EM algorithm, as it can be observed for a dataset in Fig. 1. One expects to recover the correct means of the E 1 and E 2 distributions as the two means of the Gaussian mixture components, at least as a limit for a large number of repeated experiments.
For 1000 experiments each containing 1000 indents, the fit of a Gaussian mixture model with two components is performed. The distribution of the means of these Gaussian curves is represented in Fig. 1: it can be observed in particular that these means generally underestimate the means of the original phase distributions; finding the correct average value for the stiff phase modulus has actually almost zero probability. The most likely value is around 25.4 GPa, to be compared to the expected 28 GPa; the estimated weight (volume fraction) of the soft phase is approximately 55% as compared to the expected 67% (as defined as the value of the cumulative distribution function of f at 50%). However, one may check that constraining either the weights (phase volume fractions) or means of the Gaussian components would yield correct values for the free parameters, which is characteristic of an underdetermined problem. As a conclusion, with very optimistic assumptions about the statistics of nanoindentation results (two phases, separated with moderate overlap, moderate heterogeneity, low experimental errors), it can be observed that results yielded by Gaussian mixture phase decomposition do not generally converge to the correct average values for the phases properties (both mechanical properties and phase fractions), even for an unrealistic large number of experiments. This type of method may converge to many local minima as shown by [22], and the model parameters at the global minimum do not generally coincide with the parameters of the underlying distributions. The potential bias of the method in the case where more than two phases partially overlap (for example in cementitious matrices with portlandite CH around 40 GPa and stiffer anhydrous phases), or have non-Gaussian distributions, may be significantly high.

Alternative method: hierarchical clustering applied to nanoindentation
In view of the previous results, it is useful to attribute phases to each indent based on the properties of the force-displacement curves themselves instead of identifying phase probability distributions on the whole dataset. With no a priori knowledge about the mechanical properties of each phase, one may use unsupervised clustering algorithms, a class of algorithms aiming to regroup automatically data vectors into classes based on some similarity measure. Moreover, the usual decomposition method makes no use of spatial correlations of the dataset that arises when the experiment is properly designed (indent spacing adequately small-lower than the characteristic heterogeneity size). In particular, indents in heterogeneous areas can be detected as small scale variations in measured mechanical properties, and spatially close areas with similar mechanical properties are likely to be from the same phase. The proposed algorithm attempts to take into account these remarks.

Data preprocessing
Each indentation force-displacement curve is summarized as a vector that describes as completely as possible its properties with minimal data. To each point (''observation'') from an indent is associated the vector X: X ¼ x; y; E; H; h r ; R ½ where x; y ð Þ is the spatial position, E and H the usual indentation modulus and hardness, h r the residual depth and R the ratio of the elastic strain energy to the total strain energy, that is linked to the compared curvatures of the loading and unloading curves. R is calculated as the ratio of the integrals of the forcedisplacement unloading branch and the loading branch. Although large (non-linear) correlations exist between these quantities, they are not entirely redundant. Moreover, using an enriched dataset relatively to the usual E; H ½ is expected to provide additional robustness into the method when dealing with noisy or imperfectly corrected data: R is notably invariant through a shift in displacement and therefore insensitive to errors in zero-point correction.
The local relative error of the indentation hardness is computed as the ratio of the standard deviation S to the mean l of hardness values in a neighborhood (called H x ): where the neighborhood is chosen here as the indent itself and its four nearest neighbors. This parameter characterizes the local variability in mechanical properties. The physical size of the neighborhood should be smaller than the characteristic heterogeneity length scale. In order to eliminate probable heterogeneous regions, indents are eliminated from the analysis if s H ð Þ is higher than some arbitrarily defined value. Finally, since X components are of different magnitudes and are to be compared, each measured quantity is normalized in a robust way such that each variable has identical 25% and 75% percentiles.

Data reduction and decomposition
Although data remains of reasonable dimensionality, unlike datasets with many descriptors resulting for example of Acoustic Emission signals [31], large nonlinear correlations between the chosen descriptors are observed and might tend to give excessive importance to some underlying features of the dataset. Several methods are classically used to perform dimensionality reduction and build uncorrelated variables such as Principal Component Analysis (PCA), but are based on the assumption that the initial variables are linearly correlated. It is proposed to apply here the ISOMAP algorithm [32], of the family of manifold learning algorithms [33]. The objective of this generic dimensionality reduction method is to discover the underlying non-linear degrees of freedom in the dataset while preserving the intrinsic geometry of data. In practice, it consists of three steps: first, determining the neighborhood of each point by constructing a graph of data points weighted with distances; second, computing the distances between pairs of points-as distances of the shortest path in the graph-and finally building a lower representation of the data from this distance matrix. This method is used here to obtain from the four nanoindentation parameters E; H; h r ; R two independent descriptors, V 1 ; V 2 , of the nanoindentation data (of decreasing magnitude), reducing the data vector to: These new variables are left unscaled to preserve the relative ponderation of descriptors.

Clustering algorithm
The selected clustering algorithm is of agglomerative hierarchical type [34]: with initially each observation belonging to its own cluster (''singlet''), clusters are successively merged until the final number of required clusters is reached. At each step, merging is performed according to some criterion which aims at globally achieving maximal similarity of elements inside each cluster and maximal dissimilarity between clusters (or, geometrically, maximal intra-cluster compacity and inter-cluster separation). We use the weighted Euclidean distance as a similarity measure in the now 4D space of data vectors: The weights w i are introduced in order to adjust the influence of the different measured quantities on the similarity measure. Spatial coordinates will be chosen to have a reduced influence when selecting the clusters to merge. The selected algorithm is Ward's method [35]; at each merging step, one selects the couple of clusters to merge such that the increase in intra-cluster ''inertia'' is minimal, defined as: where we have at the current step N observations in k clusters with each N i elements, and center of gravity v i (such as defined with the aforementioned distance). For a given dataset and distance, the final cluster hierarchy is unique: it does not depend on any initialization of the algorithm or on the number of clusters to be found. The final number of clusters k (where to ''stop'' the merging process) in the dataset must however be selected or deduced from some clustering quality evaluation parameter (such as [36]). The interest of the method is that one takes into account the fact that indents in the same phases should be similar in mechanical properties but also spatially close; using different wording, two distant points, to be affected to the same cluster, have to exhibit very similar mechanical properties; the practical implementation of this compromise however requires the addition of the weights as adjustable parameters.

Materials and methods
The characterized material was a CEM I cement paste with water-to-cement ratio of 0.52 (see, for example, [37]) (Fig. 2). A small sample was extracted from a large mature sample and was mounted in epoxy resin, grinded and polished with a final step during 1 h in an alcohol based diamond suspension with average grain size 0.25 lm, in order to reach adequate surface quality. The dataset on cement paste was obtained from a 32 9 32 indentation grid of spacing 2 lm acquired with a 10 lm radius spherical diamond indenter and a Zwick-Roell ZHN/SEM Ò (Ulm, Germany) nanoindenter in the laboratory conditions to avoid vacuum related issues (this SEM compatible device is used here at air in a vibration dampening chamber), with maximum force 2 mN. The quality of these points was assessed by detecting tests with large displacement jumps characteristic of surface damage, or failing the different curve fitting procedures. In the presented datasets, no such issues were met due to the smooth indenter geometry. Moreover, due to the tight spacing of the indentation grids, interaction between adjacent indents is likely in soft porous regions and should be taken with caution. In the following cases it is estimated that less than 6% of indents are carried out on partially overlapping areas, though it should mainly be of concern for largely porous areas, of limited interest. Finally, the use of spherical indentation is motivated at small indent spacing by the reported lower plastic deformation and damage in brittle materials [38,39]; however, direct comparisons of hardness results between pyramidal and spherical indenters is not straightforward, even at identical loadings since it depends on the material properties and type of behavior [40]. The 2 lm spacing is assumed to be able to resolve pure phase areas of the material, in particular in the light of latest quantitative 3D observations of cement pastes' phases by ptychographic X-ray tomography [41,42]. This technique has been used to show the 3D arrangement of single phase agglomerates in cement paste with resolution of order 100 lm.

Results on cement paste
The resulting maps of the indentation modulus and hardness are presented in Fig. 3, with residual depth and elastic-to-total strain energy ratio strongly correlated to them. The resulting mapping of a 64 lm 9 64 lm area shows relatively homogeneous subdomains of approximately 10-20 lm that can visually be attributed to 4 different phases of distinct mechanical properties. The probed area may be too small to constitute a representative sample of the cement paste mechanical properties, especially in terms of phase volume fractions, although microstructure data yields comparable orders of magnitude [43,44]. The elasticto-total strain energy ratio (R) map allows to distinguish phases with properties around the values 0.3, 0.5, 0.65 and 0.8. The standard analysis of statistical nanoindentation has been applied to this 1024 point dataset and a Gaussian mixture model with three components fitted (with full covariance matrix) to the E; H ð Þ results using the implementation of Python module scikit-learn [45]. The distribution components and E; H ð Þdata are presented in Fig. 4 and numerical parameters in Table 1.
It can be observed for this sample that phases are not clearly separated in the data histograms and that issues of the type presented in 2.b may be expected, as a large number of models may yield fits as adequate as the global optimum presented here. In particular, one may check that using a four component Gaussian mixture yields optima with components of comparable weights and largely overlapping, which constitutes an even worse decomposition of the material phase's properties.
The maps of the V 1 and V 2 variables are presented in Fig. 3 and can be seen to be much less correlated Fig. 3 Maps of the indentation modulus, indentation hardness, and of the two resulting descriptors provided by the ISOMAP algorithm, on a cement paste sample using spherical nanoindentation than the initial descriptors. The proposed method is then applied to the analysis of this dataset using empirically defined weights w 2 ð Þ is the standard deviation of the first nanoindentation descriptor; the other weights are fixed to 1. This algorithm is implemented from the Python module scikit-learn [45]. In order to calculate in each point the local variability criterion, chosen to be s H; x ð Þ\20%, the outer indents are excluded from the analysis and therefore the studied dataset makes use of 900 indents; moreover the criterion excludes 344 points from the subsequent steps (38% of the dataset). The remaining indents are classified using the hierarchical algorithm described above and results for the k ¼ 4 cluster number are represented in Fig. 5.
The obtained per phase property distributions are ''irregular'' and unsymmetrical and cannot be accurately fitted with Gaussian distributions. The clustering results for k ¼ 4 shows that areas excluded from the analysis are mostly of ''weaker'' mechanical properties that may correspond to roughness due to local porosity and/or damage, as demonstrated for example for surface defects in [46]. Their sizes are consistent with the porosity sizes observed in the electron microscopy images (few micrometers, Fig. 2). Also excluded are isolated indents with high mechanical properties, which may be attributed to minor hydration products or anhydrous phases. The procedure yields necessarily well separated clusters that optimize the prescribed criterion of separation in the space of mechanical properties mixed with some preference for local space compacity. The properties of the four phases are given in Table 2.
At the studied high water-to-cement ratio, the need to separate very high porosity areas from the bulk of the cementitious matrix is confirmed [29] although most of it is already eliminated when considering our criterion of local homogeneity. These very high porosity areas are clustered in Cluster 0. Cluster 1, 2 and 3 may therefore be classically attributed to respectively C-S-H (in majority), mixed C-S-H and CH, and mostly homogeneous portlandite (CH), with consistent mechanical properties with those reported in the literature [29,30]. The large anisotropy in elastic modulus of CH both predicted theoretically [47] and measured experimentally [46] could play a role in the large overlap between these values and those of C-S-H, since one can assume that we sample a random crystal agglomerate [41], with possible defects. Therefore the separation is mostly carried out by the other variables. The size of the studied example area being relatively small, the representativity of the sample may be too small for definitive results. Moreover, the given weights are not supposed to be equal to volume fractions: areas studied here are of arguable representativity and many points with  heterogeneous microvolumes are removed by the procedure, explaining discrepancies with the expected phase assemblage (for example [43]). Finally, it is interesting to demonstrate that the added variables h r and R add non-redundant information to the independent variables constructed using the ISOMAP algorithm. To this end, the histograms for the second variable V 2 when using only E and H mechanical data and for the full four-variable dataset are shown in Fig. 6. It can be observed that using more variables allows to change the distribution to almost Gaussian-like to a more complex distribution that can be assumed to result from the mixing of differentiated phases. Maps of the variable V 2 can also  be shown to have a greater information content and demonstrate the interest of these additional variables.

Conclusions
A classification method based on a classical hierarchical clustering algorithm has been applied to nanoindentation results, using enriched information: to the usual indentation hardness and modulus is added additional mechanical parameters-that are synthetized using a data reduction algorithm-and spatial position of the indent. This method has been shown to: 1. Allow an identification of phases on nanoindentation results that is independent of a selected statistical distribution (no need for Gaussian properties distribution or widely separated phases), identify each point unambiguously in the case of micrometer scale heterogeneity, or at least providing an efficient classification (instead of deriving a probability distribution for each phase), 2. Allow eliminating points in areas that may be largely porous or heterogeneous. 3. As a perspective, this type of method can be extended in a straightforward manner to ''multichannel'' or ''hyperspectral'' maps obtained using other experimental techniques. It will require the definition of suitable components to the data vector and an updated distance function. It may include qualitative or quantitative compositional information such as elemental ratios (SEM-EDS), density/average atomic number (SEM-backscattered imaging gray level) or structure (Raman microspectroscopy) that would allow an increased reliability of the phase separation as well as establishing correlations between the local measured quantities. If the properties of Gaussian Mixture Models are desirable (smooth distributions of properties) the hierarchical clustering method presented here may also be used to initialize or constrain fit parameters (such as the means) when many local minima are expected.