1 Introduction and Motivation

From individuals seeking information to machine learning marvels like chatbots and trading algorithms, countless entities rely on the data ocean, known as the World Wide Web. In this landscape website monitoring plays a crucial role. By analyzing constantly changing internet data, these models handle diverse tasks ranging from event detection and price tracking to ensuring compliance with evolving policies. One noteworthy example is the European Union’s 2016 Directive on website accessibility for public services (Directive (EU) 2016/2102). Monitoring tools can help to ensure these regulations are upheld, promoting an inclusive digital space for all.

When looking at the significance of website monitoring it becomes clear that tools are vital for understanding the broader landscape of digital transformation. We seek to analyze websites from local governments across Europe with the end goal of assessing their digitalization. While the assessment itself is not part of this paper, we set the foundation for an ongoing interdisciplinary research project between computer and political science called DigilogFootnote 1. There is already work claiming to measure the level of digital transformation within local governments. Garcia-Sanchez et al. [9] presents an analysis of the development of e-governments of 102 Spanish municipalities and Pina et al. [22] conduct an empirical study about the effect of e-government on transparency, openness and hence accountability in 15 countries of the EU and a total of 318 government web sites. When analyzing websites over time, mutations such as domain changes or emergings of new websites frequently occur. To maintain an up to date list of municipality URLs, we propose a method to verify websites’ authenticity, particularly distinguishing between governmental and tourism sites. Our research reveals that crawlers often mistake tourism sites for government ones. This classification task as well as all other downstream tasks (e.g. e-service detection, analysis of digital transformation, ...) require a numerical representation of the website. However, an accurate representation of websites using numerical embeddings derived from Natural Language Processing and Computer Vision models is challenging. We evaluate the performance of pre-trained model embeddings in two scenarios: using them directly and training a feedforward neural network (FNN) on top of them with domain-specific data, known as transfer learning (TL). This approach enhances generalizability for various downstream tasks.

2 Related Work

Websites use both visual (images, rendered HTML code) as well as textual (floating text) features to present content to users. To extract the full depth of information, an embedding model should be capable of processing both visual and textual data. Thus it is not surprising that Large Language Models (LLM) and Convolutional neural networks (CNN) are often used in recent work. There are also other classical machine learning approaches that rely more on feature engineering. However, they do not generalize as well as the state-of-the-art models, due to their lack of flexibility when it comes to structural changes of an HTML page. There exists a large amount of related work in the field of text-based-only embedding and classification of websites. Hashemi [11], Kowsari et al. [13] and Minaee et al. [19] provide surveys on past work and discuss different approaches on website embedding. We only mention a selection of the recent work which is related to our applied topic. Visual-only based classifications are in many cases applied to the detection of harmful content. Whether detection of propaganda of terrorism [12], alcohol, adult content, weapons [1, 7] or just food, fashion and landscapes [17], the classes all have distinctive visual features. However, in many cases, these approaches cannot distinguish visually similar pages (e.g. municipality homepage vs. tourism page about the same municipality).

In the field of text-based website embedding/classification, there are approaches that rely on classical machine learning [2, 18]. However, the majority are based on neural networks or transformers. There are several RNN and LSTM-based approaches [4, 15, 20, 24] to embed websites. Lin et al. [15] and Zhou et al. [24] additionally combine their BiLSTM approach with a CNN. There are different Transformer based approaches [5, 10, 14, 21]. We summarize the two most relevant approaches for our topic for each group (textual-/visual-based). Li et al. [14] propose MarkupLM, a pre-trained LLM for document understanding tasks based on the actual text as well as the Markup language. The model is based on the BERT architecture. They add the additional XPath embedding to the embedding layer which is based on various features to identify the target leaf. The pre-training objectives of the models are Masked Markup Language Modeling (prediction of text token of DOM tree leaf), Node Relation Prediction (e.g., child, sibling, ...), and Title-Page Matching. They compare their two models (base and large) with previous models such as FreeDOM-Full [15], SimpDOM [24], and others on the SWDE dataset considering the F1-score. They also compare their models with BERT base, RoBERTa base, and ELECTRA large models from Chen et al. [5] on the WebSRC dataset. Their large model outperforms every other model in every aspect, while their base model outperforms the others in most cases. The proposed models are available only in English. Nandanwar and Choudhary [20] propose a classification model based on GloVe and a BiLSTM for categorizing. They test the model on the WebKB data set as well as the DMOZ dataset. They further compare their model against the ensemble model of Gupta and Bhatia [10] and a Support Vector Machine web page classification approach [2]. In most cases, the proposed model outperforms the other models.

There also exists research on mixed approaches. Bruni and Bianchi [3] introduce a procedure for website classification that leverages both textual and visual features. They compare different classification algorithms to identify e-commerce services on web pages. The classification approach they propose is highly sophisticated and may not align with our specific needs as they assume classes to have certain attributes such as those related to e-commerce services. Lungeon et al. [16] propose a language-agnostic website embedding for classification tasks. With their introduced homepage embedder “Homepage2vec” they create a multilingual embedding based on word embeddings of the textual content (the first 100 sentences), the metadata tags (title, description, keywords, ...), and also the visual appearance (screenshot) and other features such as domain name. The numerical features are concatenated and processed by a neural network. They are then used for classifying the website into 14 different classes (art, business, computers, games, ...). While the feature embeddings seem to effectively capture the essence of the homepage, the model is constrained by a narrow range of broad classes.

3 Methods

In this section, we clarify which pre-trained models we used for embedding, how we applied TL, and how we evaluated the models’ embeddings.

As mentioned in Sect. 2 there are mainly three different approaches to embed websites: textual, visual, and combined methods. We apply two recently published methods and benchmark them on our dataset which is described in Sect. 4. We selected Homepage2vec [16] and MarkupLM [14] due to their performance and reproducibility. Both approaches leverage the deeper semantic understanding embedded within Markup Documents. MarkupLM incorporates the embedding of XPath and tags as features, while Homepage2vec integrates visual features alongside specific data from a Markup document, including keywords and descriptions found in the meta tag section. Notably, both authors provide a library or GitHub repository for applying their models. To accommodate the absence of a multilingual version for the MarkupLM model, text components were translated into English before being used for embedding.

Homepage2vec. We used the Homepage2vec [16] libraryFootnote 2 and its ready-made feature extractors. We slightly changed the way Homepage2vec retrieves websites. Namely, we allow for redirects using Requests. If requests cannot fetch a site, we use Selenium with a headless Chrome web driver. The library (see footnote 2) offers two options: Either use the visual embeddings using screenshots of the websites or simply leave them out. The library(See footnote 2) concatenates all the individual features and processes them using fully connected layers. We obtained 100-dimensional embeddings by accessing the last hidden layer.

MarkupLM. We used the MarkupLM [14] base modelFootnote 3 and large modelFootnote 4 to extract the text and XPath from the HTML. We limited the number of nodes to 512, which is the model’s maximum processing capacity. We then translated each text node using the Libretranslate APIFootnote 5. We leveraged the MarkupLM model to embed each node and took the mean over all nodes of each HTML to obtain the embedding for the HTML. The final embedding has a dimension of 768.

Header Section Embedding. A website typically includes a header with the structure of a website including the main topics and subtopics of the website. Based on predefined rules we extracted this header. We then extracted the text and embedded it with a multilingual BERT-based sentence embedder. The embedding has 768 dimensions.

ResNet Embedding. As a simple visual embedding method we used the pre-trained ResNet18 model for embedding screenshots of the websites. We retrieved the screenshot of each website and embedded it with this ResNet18 model. This resulted in an embedding vector with 512 dimensions.

Our TL approach involves using the models’ embeddings and training a FNN on top of them. In the first hidden layer of the FNN, the embeddings are transformed into a 100-dimensional vector. The second hidden layer is the classification head. This architecture was also used for the classification in the original training of Homepage2vec, and we adopted the same activation functions and dropout rates.

After the website contents are processed by an encoder model unto latent space, we can analyze how well the ground-truth labels (municipality vs. non-municipality) are naturally clustered. To evaluate the two clusters we used the Silhouette Score [23], the Davies-Bouldin Index [6] as well as an additional score we call Separation Distance General Score (SDG-score). In the TL case, we simply take the 100-dimensional vector from the hidden layer.

The SDG-score evaluates each cluster individually and takes the mean of every cluster value. It assesses whether each cluster is separable from the rest of the dataset by comparing the third Quantile (Q3) of the within-cluster distance to its centroid with the first Quantile (Q1) of outside cluster distances to the centroid. A high outside distance (high separation) and low within distance (compact cluster) results in a high score which is preferable.

$$\begin{aligned} {S_{SDG}} = \frac{1}{C}\sum ^{C}_{k=1}{\frac{Q3(wcd(k))}{Q1(ocd(k))}} \end{aligned}$$
(1)

C represents the total amount of clusters. The function wcd returns a vector with length \(J_{k}\), of distances of all observations whithin the k-th cluster to its centroid. The ocd function returns a vector of distances of observations outside the k-th cluster to the centroid of the \(k-th\) cluster. The vector has the length \(N-J_{k}\), with N being the number of all observations and \(J_{k}\) the number of observation within the k-th cluster.

3.1 Dataset and Infrastructure

Our dataset consists of 2901 municipality websites provided directly from the country administration and an additional 1349 municipality websites hand-labeled by domain exports. After dropping duplicates the remaining municipalities are used to retrieve 3813 non-municipality websites using DuckDuckGo by querying the municipality name + “tourism”. We use touristic websites since they share various characteristics with municipality websites and act as difficult negatives. The dataset contains websites of municipalities from ten different countries: Albania, Azerbaijan, Bulgaria, Croatia, Cyprus, Hungary, Romania, Serbia, Slovakia, and the United Kingdom. Only the landing page of each website was used for the embedding. We generally used standard parameters from PyTorch for all the methods. We applied early stopping on the validation loss with a patience of 10 epochs for training. To ensure the robustness of the trained model, we implemented stratified K-Fold Cross-Validation with validation and test set. This involved dividing our dataset into 10 folds, which was split into a training, validation, and testing set with the proportions 60:20:20. Training and performance measurement was done on an 11th Gen Intel(R) Core(TM) i9-11950H CPU with 16 cores, 32 GB RAM, and an NVIDIA RTX A2000 Laptop GPU with 4GB dedicated RAM.

Table 1. Municipality vs non-municipality website clustering scores for each embedding method showing the score without (P) and with transfer learning (TL) using a FNN on top of the frozen pre-trained models. The mean processing time per observation is also added. The time (in seconds) includes scraping, feature engineering and embedding. Best score per column is written in bold.

4 Case Studies and Results

The scoring of the embedding in Table 1 shows Homepage2vec with visual embedding to be the best embedding model when it comes to embedding domain-specific data without further adaptation. The combination of visual and textual embeddings seems to have an advantage. That is reasonable since certain distinctive features are only visually detectable by rendering images (e.g. municipalities tend to have a white background with an image of the municipality in the upper part of their websites). However, when only considering visual features, the model lacks the capability of distinctively building clusters. This is due to a lack of capability to understand the semantics of links and general text on the website. The light version of Homepage2vec without visual embedding performs worse than the other text-only-based approaches. The result of the ResNet embeddings, without further training, shows that it does not perform well in building clusters. The same is true for the pre-trained MarkupLM models and the header section embeddings. When using TL the MarkupLM models as well as the embedding of the header section perform best and have the biggest performance improvement. In both cases, the ResNet-based model is not able to compete with the other models. The high performance of Homepage2vec with visual embeddings does not translate to high performance with TL. This discrepancy may arise from the model’s original training but crucially, the Homepage2vec embeddings are limited to 100 dimensions while the embeddings used from other models exceed 500. A higher-dimensional embedding vector leads these models to have a higher performance jump from pre-trained to TL. An alternative approach with Homepage2vec could involve extracting embeddings from a different layer of the model.

Figure 1 shows the separation of tourism websites from municipality websites. The embeddings with TL show less overlap of the clusters. The mean embedding time in seconds per URL is shown in Table 1. The time for the embedding includes fetching the HTML and screenshot if needed, followed by the embedding method. In the case of MarkupLM, the translation of the page content to English is also included in the time. The Header section embedding method performs best in terms of embedding time. The time correlates with the complexity of each model.

Fig. 1.
figure 1The alternative text for this image may have been generated using AI.

Embeddings of municipality (green) and non-municipality websites colored by most frequent domain. Both approaches are capable of creating reasonable clusters. (Color figure online)

The TL results in discerning between municipality and non-municipality websites are shown in Table 2. We show the mean F1-Score and the Precision alongside their respective margin of error. We prioritize precision in this context to minimize false positives. We use the standard error of the mean (SEM) to calculate the margin of error of the 10K-Fold Cross-Validation. The MarkupLM large embedding as input to the TL step resulted in the best F1-Score and precision. There are different reasons why the MarkupLM models perform better than the Homepage2Vec model in this task. As mentioned before, the MarkupLM embedder offers embeddings 7 times larger, but it was also pre-trained differently on a broader and bigger dataset with 24 Million webpages making it potentially easier to use for TL and generalizability. Homepage2vec was trained on 886’000 webpages using multiple pre-trained models not trained on HTML content. The MarkupLM model was therefore exposed to much more HTML data during the training process than Homepage2vec. Furthermore the training dataset from MarkupLM which is a common crawl snapshot includes municipality websites, whereas the Hompage2Vec dataset does not.

Table 2. Transfer learning (TL) scores of the embedding methods using K-Fold Cross-Validation with 10 k in percentages, ± margin of error)

5 Conclusions and Future Work

We compared different models on their capability of embedding HTML documents with high diversity and also tested their performance on classifying municipality websites in a binary classification task. We rated the embedding methods with several different clustering performance scores which reward the capability of separating websites within a binary classification system. We compared the performance of pre-trained models as well as models with TL. We have seen that, based on the clustering scores, Homepage2Vec with a combined approach of using textual and visual features outperforms visual or text-only based models in all the applied clustering scores. When applying TL and comparing the outputs of the last hidden layer as embeddings we have seen that markup language-based models had the biggest improvement and outperformed the mixed approach as well as the visual-based only. When comparing the models in a binary classification MarkupLM also outperforms the other approaches with a precision of 99.36% and an F1-score of 99.18%. However, when considering processing time, which is an important factor for classifying large datasets, the base line “Header section” model has the best balance between time efficiency and classification performance, achieving competitive clustering scores with TL.

Further research could be done on how these different sections correlate with the embedding or classification of a website focusing on explainability. Text-based embeddings seem to be the best choice when it comes to TL for the classification of websites in a binary classification. We could enlarge the embeddings to not only focus on one site but rather on subsites of the domain as well. Many features of government websites are not immediately visible at the top level but become apparent at deeper levels of crawling. A comparison of models that also consider linked sites could be conducted. Further research is needed to analyze the performance of models in a multi-class classification setting for website features such as e-forms, logins, payments, etc. To encourage the model to spread out the embeddings more effectively, one could apply triplet or contrastive learning approaches as seen in SimCSE [8]. This could be coupled with more sophisticated methods to handle outlier edge cases. One approach could be to crawl potentially hard-to-classify web pages as part of a dataset augmentation strategy. When it comes to the training of classification models labeled data is a valuable asset. Additional research could explore Semi-Supervised Learning and Active Learning in this specific context. The foundation of an efficient application is a reasonable embedding of a given website which we have demonstrated is achievable.