Keywords

1 Introduction and Problem Formulation

Clustering methods, which rely on grouping data sets into clusters without imposing the number of clusters in advance [16, 20], are widely used in many applications from natural speech recognition to biology and medicine (e.g. [13]). In order to obtain significant results of clustering, aside from the choice of the specific clustering algorithm, the following issues are of fundamental importance:

  1. 1.

    the choice of suitable data representation,

  2. 2.

    the definition of adequate similarity measure.

The problem of suitable data representation has been recognized in the literature, see e.g. [6]. The choice of data representation for the investigated data set has a considerable impact on the possibility to achieve satisfying results. Studies on the impact of shape representation on clustering architectural plans have been conducted in e.g. [17].

1.1 The Problem

Our contribution is devoted to clustering 2D objects with respect to shape and size. We consider two types of shape representations: contours and silhouettes. For contour representations we use similarity measures based on a combination of DTW [1, 4, 15] and PA [5, 9, 10], as discussed in [12]. For silhouettes representations we use a number of binary distances, e.g. Hamming, Vari, Rogers-Tanimoto [18]) as well as Pattern-difference, Size-difference distance [3].

In the case of contour representation, the overall similarity measures are defined as combinations of size and shape similarity measures as proposed in [12], see Sect. 3, i.e. the clustering is performed simultaneously with respect to size and shape, whereas in the case of silhouette shape representation, we start with the clustering with respect to size, and, within the obtained (size) clusters, the clustering is next performed with respect to shape (Sect. 2).

1.2 Motivation and Contribution

The aim of the present contribution is twofold. First, by using binary matrix representation of the investigated 2D shapes we propose a clustering algorithm based on a number of discrete similarity measures combined with Affinity Propagation algorithm. Secondly, we investigate the impact of the choice of shape representation and similarity measures on the results of clustering of 2D cross-sections of rotationally symmetric 3D objects, by comparing the binary matrix approach with the contour-based approach proposed in [12].

Both approaches, the silhouette-based approach proposed in Sect. 2 and the contour-based approach proposed in [12] and recalled briefly in Sect. 3 can be applied to clustering of general 2D objects. Our motivation comes from digital cultural heritage and is related to clustering of archaeological ceramic vessels with respect to size and shape. Due to technological principles, ceramic vessels are rotationally symmetric, and so the clustering can be limited to their 2D cross sections (see Fig. 1). The papers devoted to this topic are scarce in the literature, e.g. [8, 11, 21].

The experiments are performed on real-life archaeological data representing cross-sections of ceramic vessels. Our entry data are in the form of black and white images of a unified resolution. When clustering archaeological ceramic data one can hardly specify the number of clusters in advance, so we are bound to use algorithms which do not require the number of clusters to be predefined.

The representations and the similarity measures, are discussed in Sect. 2 and Sect. 3. The results of experiments for archaeological ceramic data sets and the discussion are presented in Sect. 4.

Fig. 1.
figure 1

The transformation of a section of rotationally symmetric object into input data.

2 Silhouette-based Approach

2.1 Data Representation

In this approach we represent shapes as matrices of pixels, to be clustered as binary matrices or, in some cases, as binary vectors. Process of preparing input data consists of two steps.

The first step relies on the analysis of the size of original input data. Size analysis is performed by KMeans cluster algorithm. To find the optimal division of the dataset, this algorithm finds clusters consisting of similar-sized objects. In the case of our experimental data, the typical outcome of KMeans algorithm, is one or two subsets-clusters of objects of similar size. This step is performed by an application written in Python 3.6. in order to retain the information about size of the investigated objects, as in the second step we perform normalization, which leads to the loss of size information. This information is recalled in the last step, before the final clustering.

Second step is related to data transformation. We start with transforming every object to the chosen fixed size, which is related to resolution of provided objects. Next step is to create matrix representation of each object. The rows and columns of the matrix represent pixels of objects. Knowing that every image consists of black and white pixels only, the matrix of pixels is created as follows: element (ij) is given as 1, when the particular pixel is black and 0 when the particular pixel is white. This matrix represent the objects. This step is performed by application written in Python 3.6.

After data preparation, we perform clustering, based on similarity measures dedicated to binary data [3]. In this process we recall the size information (division obtained through KMeans), to be included in the final clustering result. Depending on the number of clusters obtained by KMeans (clustering with respect to size), next step, i.e. clustering with respect to shape is performed independently, i.e. in each cluster separately.

2.2 Similarity Measure

A lot of measures have been proposed for computing the similarity/dissimilarity between two vectors. In the present investigation we are interested only in those dedicated to binary input data. We base our choice on 76 binary similarity and dissimilarity measures recalled in [3]. Our selection of similarity measures for further analysis was focused on the diversity. As a result, five measures were selected for further analysis. Let XY - binary matrices in the same sizes. Let \(x_{ij}\in X\), \(y_{ij}\in Y\) be elements located at the same positions in rows and columns of matrices X and Y, respectively. Let \(d_{xy}\) be the number of instances where specific combinations of zero and ones occur: \(d_{00}\), \(d_{11}\), \(d_{01}\), \(d_{10}\). Initially, every \(d_{xy}\) value is set to 0. Next, we update them, i.e. for every pair of elements \(x_{ij}\) and \(y_{ij}\) the following operations are performed:

(i):

\(x_{ij}=0 \wedge y_{ij}=0\Rightarrow d_{00}=d_{00}+1\)

(ii):

\(x_{ij}=1 \wedge y_{ij}=1\Rightarrow d_{11}=d_{11}+1\)

(iii):

\(x_{ij}=0 \wedge y_{ij}=1\Rightarrow d_{01}=d_{01}+1\)

(iv):

\(x_{ij}=1 \wedge y_{ij}=0\Rightarrow d_{10}=d_{10}+1\)

In our clustering analysis we use the following similarity measures/distances:

  1. 1.

    Hamming distance: \(M_{H}(X,Y)=d_{01}+d_{10}\),

  2. 2.

    Vari distance: \(M_{V}(X,Y)=\frac{d_{01}+d_{10}}{4(d_{11}+d_{01}+d_{10}+d_{00})}\),

  3. 3.

    Rogers - Tanimoto dissimilarity coefficient: \(M_{RT}(X,Y)=\frac{2(d_{01}+ d_{10})}{d_{11}+2(d_{01}+d_{10})+d_{00}}\),

  4. 4.

    Pattern - Difference distance: \(M_{PD}(X,Y)=\frac{4d_{01}d_{10}}{(d_{11}+d_{01}+d_{10}+d_{00})^{2}}\),

  5. 5.

    Size - Difference distance: \(M_{SD}(X,Y)=\frac{(d_{01}+d_{10})^{2}}{(d_{11}+d_{01}+d_{10}+d_{00})^{2}}\),

Several characteristic features of the above similarity measures are observed:

(a):

Hamming distance focuses only on information about the difference between objects. It can be successfully applied to Android malware detection ( [19]).

(b):

Vari’s distance and Roger-Tanimoto factors analyse the full spectrum of information on object similarities and differences. It was applied to nature cases e.g. maize genetic coefficients based on microsatellite markers or diversity in olives (see e.g. [2, 18, 22] and the references therein).

(c):

The Pattern-difference and Size-difference distance measures the squares of information;

After calculating the similarity matrix, the effectiveness of each of the similarity measures was evaluated by the clustering method. The clustering is performed by the Affinity Propagation algorithm (see e.g. [7]), which does not require a predefined number of clusters. This is an essential feature of our archaeological application. This part of clustering process is performed by application written in Python 3.6.

Below there are main algorithms used in analysis. Symbols used in algorithms:

  1. 1.

    \(I_n[x_{I_n}, y_{I_n}]\) - original image \(I_n\) with sizes: \(x_{I_n}\) - width, \(y_{I_n}\) - height

  2. 2.

    \(R_n [x_{R_n}, y_{R_n}]\) - resized image \(R_n\) with sizes: \(x_{R_n}\) - width, \(y_{R_n}\) - height

  3. 3.

    \(B_n [x_{B_n}, y_{B_n}]\) - binary matrix \(B_n\) with size: \(x_{B_n}\) - number of columns, \(y_{B_n}\) - number of rows

  4. 4.

    N - number of images

  5. 5.

    D - size difference threshold

  6. 6.

    \(d_{00}\), \(d_{11}\), \(d_{01}\), \(d_{10}\) - number of instances where specific condition is passed as followed (described above)

figure a
figure b
figure c
figure d
figure e

An important feature of the approach proposed in Algorithm 1-Algorithm 5 is that the clustering analysis is performed sequentially: first with respect to size ( K-Means algorithm), and next with respect to shape (Affinity Propagation algorithm).

3 Contour-based Approach

Below we recall the contour based approach, as proposed in [12]. Contours (boundary discrete curves) are extracted from cross-sections by standard contour extraction techniques and smoothed by Savitzky-Golay filtering. Hence, objects to be clustered, are discrete curves \(\alpha \), i.e. pairs of vectors,

$$\begin{aligned} \alpha :=(x(i),y(i)),\ i=1,...,k_{\alpha }. \end{aligned}$$
(1)

3.1 Composite DTW-PA Similarity Measures

By \(n\ge 2\) we denote the cardinality of the set of contours to be clustered. By i and j, \(i,j=1,...,n\), we denote the i-th and the j-th contour, respectively. We calculate ,,distances” between contours i and j, \(i\ne j\) by using the formulas (2), (3), (4) which measure the degree of similarity between i and j,

$$\begin{aligned} \text {Procrustes measure: } pa(i,j):=\max \{PA(i,j), PA(j,i)\}, \end{aligned}$$
(2)

where PA(ij) denotes the Procrustes measure between i and j obtained from Matlab Statistics and Machine Learning Toolbox (for more details see [12]),

$$\begin{aligned} \text {direct composition measure: }dc(i,j):=\max \{DC(i,j), DC(j,i)\}, \end{aligned}$$
(3)

where \(DC(i,j)=DTW(i,Z_j)\), DTW measure is calculated in Matlab Signal Processing Toolbox, and \(Z_j\) is the optimal curve obtained from PA(ij), the measure DC was introduced and analysed in [12],

$$\begin{aligned} \text {scale component measure: }\gamma (i,j):=1-\min \{\gamma ^*(i,j), \gamma ^*(j,i)\}, \end{aligned}$$
(4)

where \(\gamma ^*(i,j)\) is the optimal scaling factor obtained from the PA(ij) (see [12] for more details). We put \(pa(i,i)=0, \ dc(i,i)=0, \ \gamma (i,i)=0.\) To ensure comparability of the above defined measures, we use the following normalization formulas for dc and \(\gamma \)

$$ ndc(j):=\frac{1}{max_{i=1,...,n}{dc(i,j)}}, \ \ \ \ \text {for}\ \ \ j=1,...,n. $$
$$ n\gamma (j):=\frac{1}{max_{i=1,...,n}{\gamma (i,j)}}, \ \ \ \ \text {for}\ \ \ j=1,...,n. $$

In absence of normalization we put \(ndc(j)=ndc=1\), for all \(j=1,...,n\) or \(n\gamma (j)=n\gamma =1\), for all \(j=1,...,n\).

3.2 Similarity Matrix

Now we recall the similarity measure (5) and the similarity matrix (6), which where defined and investigated in [12]. Let \(\mu , \lambda , \omega \in \mathbb {R}\) be given numbers (weights). The similarity measure (SM) is given as follows:

$$\begin{aligned} SM_{ij}(\mu ,\lambda ,\omega ,ndc, n\gamma ):= \end{aligned}$$
(5)
$$ \mu \cdot pa(i,j) +\lambda \cdot ndc(j) \cdot dc(i,j)+\omega \cdot n\gamma (j)\cdot \gamma (i,j). $$

Depending upon the choice of parameters \(\mu , \lambda , \omega \in \mathbb {R}\) we obtain basic measures \(pa(i,j), dc(i,j), \gamma (i,j)\), combinations of two or three of them, with, or without, normalization. By calculating \(SM_{ij}\) according to (5) for every pair of contours i and j, \(i,j=1,...,n\), we get the similarity matrix

$$\begin{aligned} SM(\mu ,\lambda ,\omega ,ndc, n\gamma ), \end{aligned}$$
(6)

which is symmetric and has zeros on the main diagonal. Different choices of \(\mu ,\lambda ,\omega ,ndc, n\gamma \) lead to different similarity matrices and consequently, to different clustering outcomes. In particular, we obtain

  1. 1.

    weighted Procrustes and scale component matrix

    $$\begin{aligned} WPSM=SM(\mu ,0,\omega ,1,1), \end{aligned}$$
    (7)
  2. 2.

    weighted direct composition and scale component matrix where direct composition values are normalized

    $$\begin{aligned} WNDCSM=SM(0,\lambda ,\omega ,ndc,1), \end{aligned}$$
    (8)
  3. 3.

    weighted direct composition and scale component matrix, where both direct composition values and scale component values are normalized

    $$\begin{aligned} WNDCNSM=SM(0,\lambda ,\omega ,ndc,n\gamma ). \end{aligned}$$
    (9)

We perform clustering on the basis of the similarity matrix (6) by the standard hierarchical algorithms (Matlab Statistics and Machine Learning Toolbox) and generate the dendrogram. The average linkage is used to measure the distance between clusters.

An important feature of the contour-based approach is that the weights appearing in the definition of the similarity measures (5) allow the simultaneous clustering with respect to size and shape.

4 Discussion of the Results of the Experiment

Below we summarize the clustering results obtained with the help of methods from Sect. 2 and Sect. 3. The experiment has been conducted on seven sets of real-life data compiled from archaeological material. One of the datasets is presented in Fig. 2. It is worth noting that there are no benchmark data sets related to clustering of archaeological ceramics.

Fig. 2.
figure 2

Set 1

According to general basic evaluation criteria (e.g. [16]), clustering results are acceptable when the resulting clusters are well defined, i.e. the distances between elements within the cluster are small (intra clusters characteristic), while the distances between clusters are high (extra clusters characteristic). Due to the unlabeled nature of the investigated data our evaluation is expert-based.

The percentage of correctly clustered elements is summarized in Table 1. Average scores have been calculated over all sets for all methods. For readability, best scores for both classes of methods (i.e. contour- and silhouette-based) are marked in bold.

Based on the results, we feel confident to evaluate the performance of both approaches as acceptable, with the average scores for each method between 78.68%–86.63%.

Table 1. Percentage of correctly classified elements for Sets 1–7 according to expert evaluation. The number of elements in each set is given in brackets.

Some correlations between characteristics of the data sets and the results obtained with the two approaches can be established. The situation is interesting for e.g. Sets 1. Here the contour-based approach performs well while the silhouette-based approach seems to be lacking. The opposite is true for Set 2, where the average score for contour-based approach is 85%, while the silhouette-based approach has an average score of 92%. Same is true for Set 4 with scores 86% and 94% respectively. This points to a set of characteristics of the original dataset that have a strong impact on the results obtained. Moreover, with few, set-specific exceptions, the choice of the similarity measure in silhouette-based approach, seem to have little impact on the resulting clusters. What is even more striking, also the applied orders of clustering: sequential (first size, next shape in silhouette-based approach), simultaneous (weighted size and shape in contour-based approach) seem not to influence the clustering results in a decisive way. One should note, however, that the reported computational experiment is based on sets of medium sizes (30–40 elements). Some robustness of the results with to respect sequential versus simultaneous approaches could be implied by the fact that size and shape are the features of completely different natures: the size can easily be quantified, whereas the formal quantified expression of shape is still the topic of current research https://www.dam.brown.edu/people/mumford/vision/shape.html, [14]. The characteristics impacting the results of the considered approaches remain to be investigated and are the subject of ongoing investigation.

5 Conclusion

In conclusion, both, the contour- and the silhouette-based approaches are viable to be applied to the clustering of 2D objects with respect to size and shape. Moreover, to the best of our knowledge, the silhouette-based approach, has not yet been used in archaeological applications. In further research, the problem of computational experiments for large data sets should be addressed.