1 Introduction and Background

In geography and the spatial humanities, research using unstructured natural language texts as a starting point for detailed inquiries has received increasing attention. Natural language texts have been identified as rich data sources in numerous spatial disciplines ranging from disaster management [1, 2], over epidemiology [3, 4] to landscape perception research [5,6,7] and have been found to frequently contain implicit or explicit spatial information. Commonly, corpora of natural language serve as underlying datasets, from which relevant information is extracted using Natural Language Processing (NLP) and human annotation. An important limitation with respect to such research are the availability of corpora relevant to specific research questions, also referred to as domain specific corpora. In contemporary research, different approaches in terms of underlying datasources are common:

  1. 1.

    Using very large general corpora (e.g. Wikipedia [8], Reddit [9])

  2. 2.

    Using more specific corpora with respect to domain and genre (e.g. Geograph [10]; micro-blogs and news articles [11])

  3. 3.

    Using social media data (e.g. Twitter [2], Hikr [12]), often collected using simple keyword search or other parameters (e.g. with respect to a location)

  4. 4.

    Generating corpora (e.g. TellUsWhere [13]; Window Expeditions [14]) for specific tasks

In this paper we leverage a small, high quality corpus of natural language landscape descriptions to identify similar domain specific documents in other corpora. Identifying relevant documents is a common topic of interest in information retrieval [cf. 15] and within this paper we address identifying documents for specific research questions. Building corpora with the identified documents is crucial in various natural language based studies ranging from natural language generation systems [16], over improving computational identification of irony and sarcasm [17] to term disambiguation [18]. However, approaches of building corpora by extracting relevant documents from existing text sources in the domain of landscape perception and preference research specifically, and with respect to answering geographic questions more generally, remain scarce, limiting the possibilities for natural language based inquiries into how landscapes are perceived and valued.

How we perceive landscapes has been found to influence our well-being and guide our behaviour [19, 20]. As such, compiling landscape specific corpora of natural language could further our understandings of how we perceive and interact with our surroundings, potentially guiding future policies and decisions. As with corpus based research in general, one of the most limiting factors in landscape perception and preference research is the availability of high quality datasets about our surroundings. Although sensor based datasets (e.g. temperature, humidity, spectral reflectance) are relatively easy to collect and are thus plentiful, experiential and perceptual datasets describing more qualitatively how individuals interact with and value landscapes remain scarce and are more difficult to collect [cf. 21].

Studies indicate the connection between natural language and landscape perception runs deep [22, 23] making natural language a valuable source of perceptual and experiential information about our environment [5, 6, 23, 24]. Analysing natural language has become commonplace in landscape perception and preference studies, ranging from identifying salient landscape characteristics through crowdsourced image descriptions in the Lake District [5], over identifying most prominent environmental features using alpine year books [12] and extracting important landscape dimensions from short stories [24], to exploring fictive motion in landscapes in hiking blogs [25]. Many of these approaches rely on making unstructured text machine readable, either through qualitative coding and human annotation or by means of natural language processing [26, 27]. However, they rely on having corpora available which contain relevant documents - the focus of this paper lies on effectively identifying such documents in large corpora.

The internet hosts an abundance of unstructured natural language, potentially containing landscape relevant information [cf. 21]. However, non-specific corpora commonly contain many irrelevant documents which need to be identified and removed. In order to ensure the landscape relevance of the underlying natural language data, moderating content achieves best results. However, moderation is only feasible in rather small collections of natural language due to the time needed to assess each document [28]. Moderating participatory data generation is thus limited to efforts resulting in small high quality datasets. The question thus arises, how can equally relevant documents in large collections of unstructured natural language be computationally identified? This requires that we numerically capture notions of document content, and one effective approach to approximating these semantics are vector representations [29, 30], which also allow for computational comparisons using cosine similarity scores [6, 29, 31]. Cosine similarity is a frequently used metric of similarity between multidimensional vectors and has been used in various natural language processing tasks ranging from clustering biomedical articles [32] to gauging landscape perception for small corpora [6]. Text vectorisation can take many forms, from simple binary vectors recording presence or absence of terms, through Term Frequency - Inverse Document Frequency (TF*IDF) weightings, and more recently approaches using Word2Vec [33] and GLoVE [34] which do not only rely on term matching. A further recent addition to vectorisation methods are Bidirectional Encoder Representations from Transformers (BERT), which mark a significant improvement from previous approaches [35]. However, using BERT to translate a high quality actively crowdsourced landscape relevant corpus to machine readable vectors and identifying similar documents in large web corpora through cosine similarity calculations has not been attempted. Having such a workflow would allow for the creation of larger datasets relevant for landscape perception and preference research.

The mentioned limitations in combination with contemporary natural language and computational linguistics methods have led to the formulation of following research questions:

  • RQ1: How can we use a small collection of high quality natural language landscape descriptions to identify relevant documents in large collections of text through natural language processing?

  • RQ2: How do the identified documents compare in terms of quality and landscape relevance?

2 Data

To collect an initial high quality dataset, we developed and implemented an active crowdsourcing platform named Window Expeditions [cf. 36]. The platform allows interested individuals to upload representative in-situ natural language descriptions of their everyday lived landscapes in three languages, English, German and French. The contributed landscape descriptions are moderated to ensure a resulting high quality dataset. Within this study, we use the first 427 English natural language landscape descriptions contributed to Window Expeditions as our initial dataset.

Since we are interested in identifying similar documents in large corpora of unstructured text, especially in web content, we included two very different additional text collections (cf. Table 1). On the one hand we chose Geograph [cf. 10], an actively crowdsourced collection containing millions of representative landscape images and descriptions. On the other hand, we include a corpus from a completely different domain, WikiHow [cf. 37] containing common questions and answers to a variety of topics. We assumed that Geograph was much more likely to contain domain relevant descriptions to our task and included WikiHow as a control containing relatively short, often informally written conversational texts.

Table 1 Corpora and number of contained documents

3 Methods

Fig. 1
figure 1

Overview of the implemented methods

In order to identify landscape relevant natural language in large collections of unstructured text, we propose a workflow based on methods of annotation, natural language processing and sentence-transformers. All methods are combined to a semi-automated and scalable workflow (Figure 1) of which the specifics are presented in the following.

Our starting point was a small collection of curated actively crowdsourced natural language landscape descriptions from the active crowdsourcing platform Window Expeditions as well as a corpus of landscape image descriptions (Geograph) and a corpus of answers to common questions (WikiHow) (cf. Section 2). The documents in the source collection (Window Expeditions) as well as the additional corpora of unstructured text (Geograph and WikiHow) were translated to a vector space. We used HuggingFace’sFootnote 1 implementation of sentence-transformers, a machine learning technique of translating unstructured natural language to machine readable representations whilst retaining underlying semantic information [38]. HuggingFace’s sentence-transformers are largely based on BERT [35] and incorporate the model all-mpnet-base-v2Footnote 2, which is based on the microsoft/mpnetbase [39] and fine-tuned with 1 billion sentence pairs from a variety of datasources including Flickr [40], Yahoo AnswersFootnote 3, WikiAnswersFootnote 4 and Reddit [41]. The presented sentence-transformers take a document as an input, truncate the document to 384 tokens and return a vector containing 768 signed decimal numbers.

In a further step, we calculated cosine similarity scores between all Window Expeditions document vectors and Geograph as well as WikiHow vectors. Cosine similarity scores are a popular measure of similarity between multi-dimensional vectors and are also used by HuggingFace to calculate document similarity. Since the vectors generated through sentence-transformers encapsulate underlying semantics, higher cosine similarity scores have been found to indicate higher convergence of underlying semantics between the represented texts [cf. 6, 31, 42].

In order to compare the results of the sentence-transformer approach to a purely lexical baseline, we repeated the steps above using a TF*IDF approach. We first clean the raw data by transforming to lower case, eliminating stopwords and special characters and performing lemmatisation. We subsequently build document frequency dictionaries and calculate the TF*IDF value of each term in each document. Finally, using terms that are found more than twice in the Window Expeditions corpus (\(N = 796\)), we build document vectors where the indices represent the identified terms and the vector values are the TF*IDF values of respective terms. This results in a 796 dimensional vector for each document in each of the three corpora (Window Expeditions, Geograph, WikiHow). These were used to calculate cosine similarity scores between all documents of all corpora.

In a final step we were interested in the topics and themes captured within our actively crowdsourced corpus of landscape descriptions and how these compared to salient topics in the corpora of identified similar documents. We performed Latent Dirichlet Allocation (LDA) topic modelling (number of topics = 10; iterations = 500) to identify clusters of terms belonging to different topics. We identified ten clusters as optimal by comparing coherence scores of the resulting models when specifying one to ten clusters. For each topical cluster we created a multi-dimensional vector representing the probabilities of terms to be contained within respective clusters. By calculating cosine similarity scores between all vectors representing Window Expeditions topic clusters and vectors representing Geograph topic clusters, we created a 10 x 10 matrix of cosine similarity scores and identified topics showing highest similarities. We present and discuss the three most similar topical clusters in further detail and visualise the terms and their probabilities with word clouds.

4 Results

4.1 Similarity Judgement and Evaluation

To investigate the effects of using a sentence-transformer based approach as opposed to our baseline method TF*IDF, we compared the cosine similarity score distributions when using sentence-transformers and TF*IDF between Window Expeditions and Geograph as well as Window Expeditions and WikiHow (cf. Figure 2). The results of the sentence-transformer approach show much lower mean cosine similarity scores for WikiHow, demonstrating that this collection did indeed contain less relevant content than Geograph (cf. Table 2). In addition, the baseline TF*IDF approach was found to have lower mean cosine similarity scores for both collections.

Fig. 2
figure 2

Cosine similarity score distributions comparing all documents from Window Expeditions with Geograph and WikiHow for the proposed sentence-transformer approach (STRAN) as well as the baseline TF*IDF approach (TFIDF). Cosine similarity values in the range of \(0 - 0.01\) were removed from the TF*IDF dataset to reduce bias towards documents showing no shared terms

Table 2 Mean cosine similarity scores using sentence-transformers compared to the baseline method TF*IDF

To evaluate the performance of both the sentence-transformer based approach and the baseline TF*IDF approach, we extracted the Geograph documents showing highest cosine similarity scores for each Window Expeditions contribution for both the sentence-transformer and the baseline TF*IDF approach. From these, we then randomly selected 50 Window Expeditions documents, and designed two simple evaluation tasks. Firstly, for each of the selected Window Expeditions contributions we judged whether document A or B was more similar, where A and B were either the top-ranked Geograph document selected by the sentence-transformer approach or TF*IDF. Secondly, for each document A and B we judged their similarity (relevance) to the original Window Expeditions document on a ternary graded relevance scale. After initial discussions and some refinement of the annotation rules we reached Cohen’s Kappa values of 0.39 for the first comparative test, and 0.32 for the second (after reducing the scale to a binary one, since very few documents were judged to be very similar). These Cohen’s Kappa values are typically interpreted as being fair. Given the complexity of our task – judging that documents discussed similar landscape settings, themes or conditions – we deemed these values to be sufficient, and one author then annotated a further 300 documents in the same way.

The results of our evaluation firstly showed that for 350 comparisons, in 153 cases the most similar document returned was judged to be a better match when using the sentence-transformer approach than using the baseline TF*IDF approach. In 19 cases, the most-similar document found by the TF*IDF approach was a better match, while in 158 both were considered equally similar and in 20 neither were a good match. These results clearly demonstrate that over the collection as a whole, the sentence-transformer based approach’s performance was better than our simple baseline.

In Figure 3 we show the relationship between cosine similarity and our ternary relevance judgements using box-plots. There is a clear, and statistically significant relationship between documents judged as non-similar/similar and the distribution of cosine similarity values for documents identified using sentence-transformers using the Welch Two Sample t-test (similarity judgements 1 vs. 2: \(t = -4.1961\), \(df = 27.352\), \(p < 0.01\); similarity judgements 1 vs. 3: \(t = -5.1503\), \(df = 30.521\), \(p < 0.01\); similarity judgements 2 vs. 3: \(t = -2.3223\), \(df = 284.89\), \(p = 0.0209\)). By contrast, when comparing similarity judgements to cosine similarity values generated using TF*IDF we find no significant relationships (similarity judgements 1 vs. 2: \(t = -1.9499\), \(df = 233\), \(p = 0.0524\); similarity judgements 1 vs. 3: \(t = -0.2862\), \(df = 106.75\), \(p = 0.7753\); similarity 2 vs. 3: \(t = -1.8486\), \(df = 124.73\), \(p = 0.0669\)). These results demonstrate that not only does our sentence-transformer based method perform better in identifying documents judged to be similar, but also that top-ranked documents identified by this method are more likely to be relevant if they have higher cosine similarities. By inspecting our box plots we suggest that a cosine-similarity threshold of 0.7 is an appropriate value at which documents are more likely to be judged similar. Since our ternary relevance judgements are for top-ranked documents, we can also use them to calculate P@1. For sentence-transformers P@1 was 0.93, for TF*IDF 0.69.

Fig. 3
figure 3

Boxplots showing cosine similarity scores between Window Expeditions contributions and most similar Geograph descriptions identified through sentence-transformers (left) and TF*IDF (right) and respective human ratings of similarity. 1 not similar, 2 somewhat similar, 3 highly similar. ***: Statistically significant difference at significance level 0.01

4.2 Exploring most Similar Documents

After investigating how our proposed approach of using sentence-transformers compares with a simple baseline, we explored the qualities of documents with high cosine similarity scores as calculated using sentence-transformers in more detail. In particular, we were interested in the number of documents found to have similarity scores of 0.7 and above, since this was suggested by Figure 3 to be an appropriate value with which to identify similar documents. The results show the proposed workflow could extract many more documents both in terms of absolute numbers and percentages in Geograph than in WikiHow (cf. table 3). This further strengthens the argument that sentence-transformers and a small high-quality natural language landscape specific dataset can be used to identify rich landscape relevant documents in other collections.

Table 3 Identified similar documents in each corpus

An initial qualitative inspection of the most similar Geograph and WikiHow documents (according to the performed cosine similarity calculations with sentence-transformer vectorisation) showed the identified texts did indeed capture landscape relevant semantics for both Geograph and WikiHow. However, the style of writing is notably different with Geograph containing descriptive natural language and WikiHow containing explanatory natural language, which is to be expected when considering the domain and genre of respective corpora, with Geograph texts describing locations, while WikiHow provides answers to specific questions.

Examples of Identified Similar Geograph documents:

  1. 1.

    When it come to autumn colour, beeches are the most colourful of our native species, retaining their leaves longer than most other species, and displaying a range of vivid gold, orange and russet until well into November. (cosine similarity = 0.85) (By Anne Burgess)

  2. 2.

    Snow, which I believe fell for about 15 hours the previous day and night, has been removed from the tops of the trees by strong winds, but it has adhered to the eastern side of trunks and branches, and lies thick on the ground. It’s been excellently scrunchy snow for snowmen and snowballs. This view is from where [5715067] was taken. (cosine similarity = 0.84) (By Derek Harper)

  3. 3.

    Beautiful sand, interesting surf, good rocky outcrops make this an excellent beach. With little wind there is still surf, with some westerlies this beach gets exciting. (cosine similarity = 0.83) (By Peter Church)

Examples of Identified Similar WikiHow Documents:

  1. 1.

    The sound of a car without a muffler chugging down the street is never welcome, so make sure you’re not the one causing noise pollution in your neighborhood and spring to get your car fixed. Keeping your car in good, quiet working order will be appreciated by everyone who lives near you. The same goes for your lawnmower and any other noisy equipment you might use outdoors. To have an even greater impact on noise in your area, consider walking or biking instead of using a car whenever possible. (cosine similarity = 0.74)

  2. 2.

    As urban development stretches into rural areas, the noise level increases. Construction sites, airports, train stations, and highways are all sources of loud noises that grate on the ears. If you know the sources of noise pollution in your area, you can do your best to avoid them or find ways to mitigate their negative effect. When you’re choosing a place to live, see if the residence is in a flight path or near a busy highway. During the day the sounds might not bother you, but at night they might prevent you from sleeping. (cosine similarity = 0.74)

  3. 3.

    Being in the sunshine and fresh air has proven health benefits, from easing depression to improving your outlook on life.Go for a walk, take some photographs, or simply sit on your porch to enjoy the benefits of fresh air. If you live somewhere too cold to go outside, consider getting a sunlamp to compensate for the lack of daylight. (cosine similarity = 0.70)

The results show that Geograph representative landscape image descriptions can be very similar to the contributions to Window Expeditions, addressing very similar themes about what people perceive in landscapes, and sometimes written in similar styles.The identified WikiHow documents, despite the different style of writing, address similar themes to some of the Window Expeditions contributions such as soundscapes and noise (a frequent topic found in in-situ landscape descriptions [7, 10]) or the general benefits of being outdoors. This initial qualitative inspection of the resulting identified documents goes shows that the proposed workflow identifies documents rich in landscape information.

To further investigate the similarities between documents, we calculate the number of identified similar documents for each individual document in Window Expeditions (cf. Figure 4). The results show that many documents have a small number of identified similar documents and a small number having many similar documents. This suggests that a small number of documents (visible as steps of high increase in cumulative sum in Figure 4) are particularly important in identifying similar documents in the other collection, hinting at particularly salient topics. In the following, examples of such documents are shown with the respective number of identified similar documents in the other collection.

Window Expeditions Documents with the Highest Number of Similar Geograph Documents

  1. 1.

    A view over open fields with boundaries of hedges and trees. Rising hills out of the low valley. (number of similar Geograph documents = 1895)

  2. 2.

    There are two roads, both full of cars. There is the church across the street, and power lines obscuring my view. I can see some various road signs, and a lot of tress, If I really focus I can see a parking lot for UNCC - South Deck I believe. (number of similar Geograph documents = 621)

  3. 3.

    I can see a small fragment of car park with streetlights. Beyond that is an area of grassland with a few scrubby trees. Then in the distance I can see a hillside with two wind turbines on the top of the hill. The land cover on this hillside is mainly agricultural land with a few scattered trees and linear hedgerows. (number of similar Geograph documents = 412)

Fig. 4
figure 4

Cumulative sum of identified similar Geograph documents identified using the sentence-transformer based approach with cosine similarity greater than 0.7

Window Expeditions documents with the highest number of similar Geograph image descriptions often share terms such as “view”, “hill”, “tree” and “car” between one or all documents. This suggests that the Geograph collection hosts many image descriptions revolving around what contributors see, the topography of a given area and mentions of transportation and vehicles.

To further investigate the similarity between Window Expeditions and the identified similar documents, we performed latent Dirichlet allocation (LDA) topic modelling. We only applied LDA topic modelling to the identified similar Geograph documents given the very small number of similar WikiHow texts found. After comparing all topics in the two corpora using cosine similarity scores (Figure 5), we identified three topical clusters showing the highest similarities between Window Expeditions and Geograph and explore these further.

Fig. 5
figure 5

Cosine similarity scores between all LDA topic clusters of Window Expeditions and Geograph

Fig. 6
figure 6

Wordclouds of clusters representing the topic Snow and Weather in landscapes

The topic showing highest similarity between Window Expeditions and Geograph (cosine similarity score = 0.67) revolves around the general theme of snow and weather related phenomena. Particularly salient are the terms associated with winter and snow “snow”, “winter”, “cold” and “cover” as well as terms referring to other weather related phenomena including “sun” and “rain” (Figure 6). This suggests that weather and in particular snow are important topics captured both within the Window Expeditions corpus as well as within the identified similar Geograph documents.

Fig. 7
figure 7

Wordclouds of clusters representing the topic Rural and Natural landscapes

A further salient topic found in both Window Expeditions and Geograph (cosine similarity score = 0.63) and identified through LDA topic modelling is the general theme of the countryside and rural areas (Figure 7). Prominent shared terms between the clusters include “hill” as well as “field” and additional terms relating to rural areas such as “horse”, “cattle”, “pasture” and “farmland” are found in one or the other cluster. This suggests that rural and more natural areas are important topics in both the original Window Expeditions as well as the newly generated corpora.

Fig. 8
figure 8

Wordclouds of clusters representing the topic Urban and Residential landscapes

Finally, we find the topic of urban and residential areas salient in both Window Expeditions as well as Geograph (cosine similarity score = 0.66). Frequent terms reflect common elements found in everyday lived landscapes and include “house”, “garden”, “home”, “road”, and “tree” (Figure 8). These suggest that both corpora capture natural and rural as well as urban and residential landscapes and highlights the fact that both corpora capture similar semantics regarding landscapes.

5 Discussion

The results of generating a large landscape relevant corpus of natural language using a small curated high quality dataset point towards a number of interesting observations which we discuss in more detail below. Specifically we discuss the properties of the generated corpora and we explore the topics that emerge from the data. Further, we present the limitations of this study and potential avenues of future work.

Using a high quality domain specific dataset - in this case actively crowdsourced in-situ natural language landscape descriptions - as a basis for identifying similar documents in other corpora through sentence-transformers was found to successfully generate new domain-specific corpora. Sentence-transformers encapsulate a document’s meaning into a representative machine readable vector [35], however, questions of model training and nuances in language remain.

The proposed approach using sentence-transformers was found to identify more similar documents, as judged by human annotators, compared to the baseline approach of using TF*IDF. The better performance of sentence-transformers somewhat contradicts findings of similar studies where TF*IDF based vectorisation approaches have been found to outperform more recent word embedding approaches [cf. 45, 46, 47]. A potential explanation is the domain specificity of the initial Window Expeditions dataset and the target Geograph corpus. Since both are landscape relevant and thus the language is domain specific, sentence-transformer based vectorisation appears to encapsulate more underlying semantic information into the resulting vectors than the baseline TF*IDF approach. This is to be expected seeing TF*IDF is based on a bag-of-words approach and sentence-transformers use large pre-trained language models.

The past three decades have seen a shift from top-down expert based landscape perception research to bottom-up approaches of landscape characterisations [48,49,50]. In line with contemporary frameworks, modern approaches incorporate participatory data generation efforts. Including the views, values and perceptions of a heterogeneous group of individuals in landscape perception and preference research is crucial to understanding various human-landscape interactions and has seen increased attention [cf. 7, 21, 51]. However, participatory data generation efforts can be time consuming and costly, limiting spatial coverage of the underlying data.

With increasing global internet access, an abundance of user generated content has been created and stored. These vasts amounts of unstructured natural language potentially contain many documents highly relevant for certain scientific inquiries, in this case landscape perception. However, identifying potentially relevant documents in large text collections remains challenging. Landscape relevant documents have been extracted from large corpora using rule based filtering techniques [10], however, using a small high quality corpus in combination with sentence-transformers has not been attempted. The results of this study show that the proposed workflow is indeed able to identify similar documents in corpora of the same domain (Geograph) as well as in large web corpora of a different domain (WikiHow). This strengthens the argument that sentence-transformers do indeed capture underlying semantics, even in specific domains, and that the proposed workflow can be used to create large corpora of landscape relevant data. Going beyond landscape perception research, the proposed workflow could potentially be used in a variety of projects.

To further investigate the topics captured within the identified documents, we perform Latent Dirichlet Allocation topic modeling and compare the resulting clusters. The results show that both the Window Expeditions and the Geograph corpora capture similar themes as reported by participants. The identified topics revolve around snow and weather, rural and natural landscapes and urban and residential areas. Interestingly, particularly salient landscape dimensions identified in the literature are mountains, water bodies and recreational areas [cf. 52, 53]. The generated topic revolving around rural and natural landscapes captures these concepts, however, we also identified urban and residential areas to be particularly important, calling for further research on how the more immediate surroundings, the everyday lived landscapes, of participants are perceived. The results of the LDA topic modelling underline the potential of the proposed workflow, seeing that both the Window Expeditions as well as the identified similar Geograph documents capture similar topics and landscape dimensions.

5.1 Limitations

Using a small curated high quality collection of natural language texts, the proposed workflow is able to successfully identify similar documents in large corpora as a means of creating large domain and genre specific datasets. However, the approach is accompanied by three key limitations.

Firstly, the literature agrees that language and culture are intertwined at a fundamental level [23, 54]. However, many studies on landscapes and on cultural differences in general are conducted in and written for an English speaking population. Seeing we tested the proposed workflow on the English language and the limited availability of pre-trained sentence-transformer models in other languages, the proposed workflow may only achieve high quality results in English. The lack of multi-lingual investigations into landscapes calls for more cross-cultural and cross-linguistic explorations of language and landscape. Furthermore, since our approach uses user generated content as a basis, it is subject to the well known limitations of this form (e.g. participation inequality and potential biases in the demographics of contributors) [55] and thus our texts may not be representative of all ways in which landscapes are perceived, and are likely to biased towards popular locations [56].

Secondly, the methodological approach leads to a high dimensionality of TF*IDF and sentence-transformer derived vectors. Seeing the TF*IDF vectors are sparse with most values being 0 (term not present in respective Window Expeditions and Geograph document), the resulting cosine similarity calculations for TF*IDF are heavily biased towards 0. Using dimensionality reduction techniques such as t-SNE [57] could potentially reduce this bias leading to better baseline results. In addition, Window Expeditions is not geographically limited in scope, and we refrain from using spatially explicit criteria for our similar document identification workflow. One approach might be to use geographically trained language models [cf. 58] as an additional criteria for judging similarity. Furthermore, truncating documents to 384 tokens for calculation of the sentence-transformer derived vectors may also result in some loss of information.

A final major limitation of the proposed workflow to generate large domain specific natural language corpora is the black-box nature of many widely adopted machine learning NLP approaches. The resulting vectors have no human significance and it is thus impossible to retrace the path of initial individual documents to the results. Slight changes to the underlying sentence-transformer model could eliminate any reproducibility. In addition, the underlying training data used to pre-train the sentence-transformers used within this study are mostly large web corpora and thus might not capture domain specific nuances. We thus propose future research should focus on building domain specific sentence-transformer models. In the case of landscape research, we could use corpora generated through the proposed workflow to train a new sentence-transformer model with landscape relevant natural language (e.g. LANDBert).