Identifying Landscape Relevant Natural Language using Actively Crowdsourced Landscape Descriptions and Sentence-Transformers

Natural language has proven to be a valuable source of data for various scientific inquiries including landscape perception and preference research. However, large high quality landscape relevant corpora are scare. We here propose and discuss a natural language processing workflow to identify landscape relevant documents in large collections of unstructured text. Using a small curated high quality collection of actively crowdsourced landscape descriptions we identify and extract similar documents from two different corpora (Geograph and WikiHow) using sentence-transformers and cosine similarity scores. We show that 1) sentence-transformers combined with cosine similarity calculations successfully identify similar documents in both Geograph and WikiHow effectively opening the door to the creation of new landscape specific corpora, 2) the proposed sentence-transformer approach outperforms traditional Term Frequency - Inverse Document Frequency based approaches and 3) the identified documents capture similar topics when compared to the original high quality collection. The presented workflow is transferable to various scientific disciplines in need of domain specific natural language corpora as underlying data.


Introduction and Background
In geography and the spatial humanities, research using unstructured natural language texts as a starting point for detailed inquiries has received increasing attention. Natural language texts have been identified as rich data sources in numerous spatial disciplines ranging from disaster management [1,2], over epidemiology [3,4] to landscape perception research [5][6][7] and have been found to frequently contain implicit or explicit spatial information. Commonly, corpora of natural language serve as underlying datasets, from which relevant information is extracted using Natural Language Processing (NLP) and human annotation. An important limitation with respect to such research are the availability of corpora relevant to specific research questions, also referred to as domain specific corpora. In contemporary research, different approaches in terms of underlying datasources are common: various natural language based studies ranging from natural language generation systems [16], over improving computational identification of irony and sarcasm [17] to term disambiguation [18]. However, approaches of building corpora by extracting relevant documents from existing text sources in the domain of landscape perception and preference research specifically, and with respect to answering geographic questions more generally, remain scarce, limiting the possibilities for natural language based inquiries into how landscapes are perceived and valued.
How we perceive landscapes has been found to influence our well-being and guide our behaviour [19,20]. As such, compiling landscape specific corpora of natural language could further our understandings of how we perceive and interact with our surroundings, potentially guiding future policies and decisions. As with corpus based research in general, one of the most limiting factors in landscape perception and preference research is the availability of high quality datasets about our surroundings. Although sensor based datasets (e.g. temperature, humidity, spectral reflectance) are relatively easy to collect and are thus plentiful, experiential and perceptual datasets describing more qualitatively how individuals interact with and value landscapes remain scarce and are more difficult to collect [cf. 21].
Studies indicate the connection between natural language and landscape perception runs deep [22,23] making natural language a valuable source of perceptual and experiential information about our environment [5,6,23,24]. Analysing natural language has become commonplace in landscape perception and preference studies, ranging from identifying salient landscape characteristics through crowdsourced image descriptions in the Lake District [5], over identifying most prominent environmental features using alpine year books [12] and extracting important landscape dimensions from short stories [24], to exploring fictive motion in landscapes in hiking blogs [25]. Many of these approaches rely on making unstructured text machine readable, either through qualitative coding and human annotation or by means of natural language processing [26,27]. However, they rely on having corpora available which contain relevant documents -the focus of this paper lies on effectively identifying such documents in large corpora.
The internet hosts an abundance of unstructured natural language, potentially containing landscape relevant information [cf. 21]. However, non-specific corpora commonly contain many irrelevant documents which need to be identified and removed. In order to ensure the landscape relevance of the underlying natural language data, moderating content achieves best results. However, moderation is only feasible in rather small collections of natural language due to the time needed to assess each document [28]. Moderating participatory data generation is thus limited to efforts resulting in small high quality datasets. The question thus arises, how can equally relevant documents in large collections of unstructured natural language be computationally identified? This requires that we numerically capture notions of document content, and one effective approach to approximating these semantics are vector representations [29,30], which also allow for computational comparisons using cosine similarity scores [6,29,31]. Cosine similarity is a frequently used metric of similarity between multidimensional vectors and has been used in various natural language processing tasks ranging from clustering biomedical articles [32] to gauging landscape perception for small corpora [6]. Text vectorisation can take many forms, from simple binary vectors recording presence or absence of terms, through Term Frequency -Inverse Document Frequency (TF*IDF) weightings, and more recently approaches using Word2Vec [33] and GLoVE [34] which do not only rely on term matching. A further recent addition to vectorisation methods are Bidirectional Encoder Representations from Transformers (BERT), which mark a significant improvement from previous approaches [35]. However, using BERT to translate a high quality actively crowdsourced landscape relevant corpus to machine readable vectors and identifying similar documents in large web corpora through cosine similarity calculations has not been attempted. Having such a workflow would allow for the creation of larger datasets relevant for landscape perception and preference research.
The mentioned limitations in combination with contemporary natural language and computational linguistics methods have led to the formulation of following research questions: • RQ1: How can we use a small collection of high quality natural language landscape descriptions to identify relevant documents in large collections of text through natural language processing? • RQ2: How do the identified documents compare in terms of quality and landscape relevance?

Data
To collect an initial high quality dataset, we developed and implemented an active crowdsourcing platform named Window Expeditions [cf. 36]. The platform allows interested individuals to upload representative in-situ natural language descriptions of their everyday lived landscapes in three languages, English, German and French. The contributed landscape descriptions are moderated to ensure a resulting high quality dataset. Within this study, we use the first 427 English natural language landscape descriptions contributed to Window Expeditions as our initial dataset.
Since we are interested in identifying similar documents in large corpora of unstructured text, especially in web content, we included two very different additional text collections (cf. Table 1). On the one hand we chose Geograph [cf. 10], an actively crowdsourced collection containing millions of representative landscape images and descriptions. On the other hand, we include a corpus from a completely different domain, WikiHow [cf. 37] containing common questions and answers to a variety of topics. We assumed that Geograph was much more likely to contain domain relevant descriptions to our task and included WikiHow as a control containing relatively short, often informally written conversational texts.

Methods
In order to identify landscape relevant natural language in large collections of unstructured text, we propose a workflow based on methods of annotation, natural language processing and sentence-transformers. All methods are combined to a semi-automated and scalable workflow ( Figure 1) of which the specifics are presented in the following.
Our starting point was a small collection of curated actively crowdsourced natural language landscape descriptions from the active crowdsourcing platform Window Expeditions as well as a corpus of landscape image descriptions (Geograph) and a corpus of answers to common questions (WikiHow) (cf. Section 2). The documents in the source collection (Window Expeditions) as well as the additional corpora of unstructured text (Geograph and WikiHow) were translated to a vector space. We used HuggingFace's 1 implementation of sentence-transformers, a machine learning technique of translating unstructured natural language to machine readable representations whilst retaining underlying semantic information [38]. HuggingFace's sentence-transformers are largely based on BERT [35] and incorporate the model all-mpnet-base-v2 2 , which is based on the microsoft/ mpnetbase [39] and fine-tuned with 1 billion sentence pairs from a variety of datasources including Flickr [40], Yahoo Answers 3 , WikiAnswers 4 and Reddit [41]. The presented sentence-transformers take a document as an input, truncate the document to 384 tokens and return a vector containing 768 signed decimal numbers.
In a further step, we calculated cosine similarity scores between all Window Expeditions document vectors and Geograph as well as WikiHow vectors. Cosine similarity scores are a popular measure of similarity between multi-dimensional vectors and are also used by HuggingFace to calculate document similarity. Since the vectors generated through sentence-transformers encapsulate underlying semantics, higher cosine similarity scores have been found to indicate higher convergence of underlying semantics between the represented texts [cf. 6,31,42].
In order to compare the results of the sentence-transformer approach to a purely lexical baseline, we repeated the steps above using a TF*IDF approach. We first clean the raw data by transforming to lower case, eliminating stopwords and special characters and performing lemmatisation. We subsequently build document frequency dictionaries and calculate the TF*IDF value of each term in each document. Finally, using terms that are found more than twice in the Window Expeditions corpus ( N = 796 ), we build document vectors where the indices represent the identified terms and the vector values are the TF*IDF values of respective terms. This results in a 796 dimensional vector for each document in each of the three corpora (Window Expeditions, Geograph, WikiHow). These were used to calculate cosine similarity scores between all documents of all corpora.
In a final step we were interested in the topics and themes captured within our actively crowdsourced corpus of landscape descriptions and how these compared to salient topics in the corpora of identified similar documents. We performed Latent Dirichlet Allocation (LDA) topic modelling (number of topics = 10; iterations = 500) to identify clusters of terms belonging to different topics. We identified ten clusters as optimal by comparing coherence scores of the resulting models when specifying one to ten clusters. For each topical cluster we created a multi-dimensional vector representing the probabilities of terms to be contained within respective clusters. By calculating cosine similarity scores

Similarity Judgement and Evaluation
To investigate the effects of using a sentence-transformer based approach as opposed to our baseline method TF*IDF, we compared the cosine similarity score distributions when using sentence-transformers and TF*IDF between Window Expeditions and Geograph as well as Window Expeditions and WikiHow (cf. Figure 2). The results of the sentencetransformer approach show much lower mean cosine similarity scores for WikiHow, demonstrating that this collection did indeed contain less relevant content than Geograph (cf. Table 2). In addition, the baseline TF*IDF approach was found to have lower mean cosine similarity scores for both collections.
To evaluate the performance of both the sentencetransformer based approach and the baseline TF*IDF approach, we extracted the Geograph documents showing highest cosine similarity scores for each Window Expeditions contribution for both the sentence-transformer and the baseline TF*IDF approach. From these, we then randomly selected 50 Window Expeditions documents, and designed two simple evaluation tasks. Firstly, for each of the selected Window Expeditions contributions we judged whether document A or B was more similar, where A and B were either the top-ranked Geograph document selected Fig. 1 Overview of the implemented methods by the sentence-transformer approach or TF*IDF. Secondly, for each document A and B we judged their similarity (relevance) to the original Window Expeditions document on a ternary graded relevance scale. After initial discussions and some refinement of the annotation rules we reached Cohen's Kappa values of 0.39 for the first comparative test, and 0.32 for the second (after reducing the scale to a binary one, since very few documents were judged to be very similar). These Cohen's Kappa values are typically interpreted as being fair. Given the complexity of our task -judging that documents discussed similar landscape settings, themes or conditions -we deemed these values to be sufficient, and one author then annotated a further 300 documents in the same way.
The results of our evaluation firstly showed that for 350 comparisons, in 153 cases the most similar document returned was judged to be a better match when using the sentence-transformer approach than using the baseline TF*IDF approach. In 19 cases, the most-similar document found by the TF*IDF approach was a better match, while in 158 both were considered equally similar and in 20 neither were a good match. These results clearly demonstrate that over the collection as a whole, the sentence-transformer based approach's performance was better than our simple baseline.
In Figure 3 we show the relationship between cosine similarity and our ternary relevance judgements using boxplots. There is a clear, and statistically significant relationship between documents judged as non-similar/similar and the distribution of cosine similarity values for documents identified using sentence-transformers using the Welch Two These results demonstrate that not only does our sentence-transformer based method perform better in identifying documents judged to be similar, but also that top-ranked documents identified by this method are more likely to be relevant if they have higher cosine similarities. By inspecting our box plots we suggest that a cosine-similarity threshold of 0.7 is an appropriate value at which documents are more likely to be judged similar. Since our ternary relevance judgements are for topranked documents, we can also use them to calculate P@1. For sentence-transformers P@1 was 0.93, for TF*IDF 0.69.

Exploring most Similar Documents
After investigating how our proposed approach of using sentence-transformers compares with a simple baseline, we explored the qualities of documents with high cosine similarity scores as calculated using sentence-transformers in more detail. In particular, we were interested in the number of documents found to have similarity scores of 0.7 and above, since this was suggested by Figure 3 to be an appropriate value with which to identify similar documents. The results show the proposed workflow could extract many more documents both in terms of absolute numbers and percentages in Geograph than in WikiHow (cf . table 3). This further strengthens the argument that sentence-transformers and a small high-quality natural language landscape specific dataset can be used to identify rich landscape relevant documents in other collections.  An initial qualitative inspection of the most similar Geograph and WikiHow documents (according to the performed cosine similarity calculations with sentence-transformer vectorisation) showed the identified texts did indeed capture landscape relevant semantics for both Geograph and WikiHow. However, the style of writing is notably different with Geograph containing descriptive natural language and WikiHow containing explanatory natural language, which is to be expected when considering the domain and genre of respective corpora, with Geograph texts describing locations, while WikiHow provides answers to specific questions.

Examples of Identified Similar WikiHow Documents:
1. The sound of a car without a muffler chugging down the street is never welcome, so make sure you're not the one causing noise pollution in your neighborhood and spring to get your car fixed. Keeping your car in good, quiet working order will be appreciated by everyone who lives near you. The same goes for your lawnmower and any other noisy equipment you might use outdoors. To have an even greater impact on noise in your area, consider walking or biking instead of using a car whenever possible. (cosine similarity = 0.74) 2. As urban development stretches into rural areas, the noise level increases. Construction sites, airports, train stations, and highways are all sources of loud noises that grate on the ears. If you know the sources of noise pollution in your area, you can do your best to avoid them or find ways to mitigate their negative effect. When you're choosing a place to live, see if the residence is in a flight path or near a busy highway. During the day the sounds and respective human ratings of similarity. 1 not similar, 2 somewhat similar, 3 highly similar. ***: Statistically significant difference at significance level 0.01  [7,10]) or the general benefits of being outdoors. This initial qualitative inspection of the resulting identified documents goes shows that the proposed workflow identifies documents rich in landscape information.
To further investigate the similarities between documents, we calculate the number of identified similar documents for each individual document in Window Expeditions (cf. Figure 4). The results show that many documents have a small number of identified similar documents and a small number having many similar documents. This suggests that a small number of documents (visible as steps of high increase in cumulative sum in Figure 4 Window Expeditions documents with the highest number of similar Geograph image descriptions often share terms such as "view", "hill", "tree" and "car" between one or all documents. This suggests that the Geograph collection hosts many image descriptions revolving around what contributors see, the topography of a given area and mentions of transportation and vehicles. To further investigate the similarity between Window Expeditions and the identified similar documents, we performed latent Dirichlet allocation (LDA) topic modelling. We only applied LDA topic modelling to the identified similar Geograph documents given the very small number of similar WikiHow texts found. After comparing all topics in the two corpora using cosine similarity scores ( Figure 5), we identified three topical clusters showing the highest Fig. 4 Cumulative sum of identified similar Geograph documents identified using the sentence-transformer based approach with cosine similarity greater than 0.7 similarities between Window Expeditions and Geograph and explore these further.
The topic showing highest similarity between Window Expeditions and Geograph (cosine similarity score = 0.67) revolves around the general theme of snow and weather related phenomena. Particularly salient are the terms associated with winter and snow "snow", "winter", "cold" and "cover" as well as terms referring to other weather related phenomena including "sun" and "rain" (Figure 6). This suggests that weather and in particular snow are important topics captured both within the Window Expeditions corpus as well as within the identified similar Geograph documents.
A further salient topic found in both Window Expeditions and Geograph (cosine similarity score = 0.63) and identified through LDA topic modelling is the general theme of the countryside and rural areas (Figure 7). Prominent shared terms between the clusters include "hill" as well as "field" and additional terms relating to rural areas such as "horse", "cattle", "pasture" and "farmland" are found in one or the other cluster. This suggests that rural and more natural areas are important topics in both the original Window Expeditions as well as the newly generated corpora.
Finally, we find the topic of urban and residential areas salient in both Window Expeditions as well as Geograph (cosine similarity score = 0.66). Frequent terms reflect common elements found in everyday lived landscapes and include "house", "garden", "home", "road", and "tree" (Figure 8). These suggest that both corpora capture natural and rural as well as urban and residential landscapes and highlights the fact that both corpora capture similar semantics regarding landscapes.

Discussion
The results of generating a large landscape relevant corpus of natural language using a small curated high quality dataset point towards a number of interesting observations which we discuss in more detail below. Specifically we discuss the properties of the generated corpora and we explore the topics that emerge from the data. Further, we present the limitations of this study and potential avenues of future work.
Using a high quality domain specific dataset -in this case actively crowdsourced in-situ natural language landscape descriptions -as a basis for identifying similar documents in other corpora through sentence-transformers was found to successfully generate new domain-specific corpora. Sentence-transformers encapsulate a document's meaning into a representative machine readable vector [35], however, questions of model training and nuances in language remain.
The proposed approach using sentence-transformers was found to identify more similar documents, as judged by human annotators, compared to the baseline approach of using TF*IDF. The better performance of sentencetransformers somewhat contradicts findings of similar studies where TF*IDF based vectorisation approaches have been found to outperform more recent word embedding approaches [cf. 45,46,47]. A potential explanation is the domain specificity of the initial Window Expeditions dataset and the target Geograph corpus. Since both are landscape relevant and thus the language is domain specific, sentence-transformer based vectorisation appears to encapsulate more underlying semantic information into the resulting vectors than the baseline TF*IDF approach. This is to be expected seeing TF*IDF is based on a bagof-words approach and sentence-transformers use large pre-trained language models.
The past three decades have seen a shift from top-down expert based landscape perception research to bottom-up approaches of landscape characterisations [48][49][50]. In line with contemporary frameworks, modern approaches incorporate participatory data generation efforts. Including the views, values and perceptions of a heterogeneous group of individuals in landscape perception and preference research is crucial to understanding various humanlandscape interactions and has seen increased attention [cf. 7, 21, 51]. However, participatory data generation efforts can be time consuming and costly, limiting spatial coverage of the underlying data. With increasing global internet access, an abundance of user generated content has been created and stored. These vasts amounts of unstructured natural language potentially contain many documents highly relevant for certain scientific inquiries, in this case landscape perception. However, identifying potentially relevant documents in large text collections remains challenging. Landscape relevant documents have been extracted from large corpora using rule based filtering techniques [10], however, using a small high quality corpus in combination with sentence-transformers has not been attempted. The results of this study show that the proposed workflow is indeed able to identify similar documents in corpora of the same domain (Geograph) as well as in large web corpora of a different domain (WikiHow). This strengthens the argument that sentence-transformers do indeed capture underlying semantics, even in specific domains, and that the proposed workflow can be used to create large corpora of landscape relevant data. Going beyond landscape perception research, the proposed workflow could potentially be used in a variety of projects. To further investigate the topics captured within the identified documents, we perform Latent Dirichlet Allocation topic modeling and compare the resulting clusters. The results show that both the Window Expeditions and the Geograph corpora capture similar themes as reported by participants. The identified topics revolve around snow and weather, rural and natural landscapes and urban and residential areas. Interestingly, particularly salient landscape dimensions identified in the literature are mountains, water bodies and recreational areas [cf. 52,53]. The generated topic revolving around rural and natural landscapes captures these concepts, however, we also identified urban and residential areas to be particularly important, calling for further research on how the more immediate surroundings, the everyday lived landscapes, of participants are perceived. The results of the LDA topic modelling underline the potential of the proposed workflow, seeing that both the Window Expeditions as well as the identified similar Geograph documents capture similar topics and landscape dimensions.

Limitations
Using a small curated high quality collection of natural language texts, the proposed workflow is able to successfully identify similar documents in large corpora as a means of creating large domain and genre specific datasets. However, the approach is accompanied by three key limitations.
Firstly, the literature agrees that language and culture are intertwined at a fundamental level [23,54]. However, many studies on landscapes and on cultural differences in general are conducted in and written for an English speaking population. Seeing we tested the proposed workflow on the English language and the limited availability of pre-trained sentence-transformer models in other languages, the proposed workflow may only achieve high quality results in English. The lack of multi-lingual investigations into landscapes calls for more cross-cultural and cross-linguistic explorations of language and landscape. Furthermore, since our approach uses user generated content as a basis, it is subject to the well known limitations of this form (e.g. participation inequality and potential biases in the demographics of contributors) [55] and thus our texts may not be representative of all ways in which landscapes are perceived, and are likely to biased towards popular locations [56].
Secondly, the methodological approach leads to a high dimensionality of TF*IDF and sentence-transformer derived vectors. Seeing the TF*IDF vectors are sparse with most values being 0 (term not present in respective Window Expeditions and Geograph document), the resulting cosine similarity calculations for TF*IDF are heavily biased towards 0. Using dimensionality reduction techniques such as t-SNE [57] could potentially reduce this bias leading to better baseline results. In addition, Window Expeditions is not geographically limited in scope, and we refrain from using spatially explicit criteria for our similar document identification workflow. One approach might be to use geographically trained language models [cf. 58] as an additional criteria for judging similarity. Furthermore, truncating documents to 384 tokens for calculation of the sentence-transformer derived vectors may also result in some loss of information.
A final major limitation of the proposed workflow to generate large domain specific natural language corpora is the Fig. 8 Wordclouds of clusters representing the topic Urban and Residential landscapes black-box nature of many widely adopted machine learning NLP approaches. The resulting vectors have no human significance and it is thus impossible to retrace the path of initial individual documents to the results. Slight changes to the underlying sentence-transformer model could eliminate any reproducibility. In addition, the underlying training data used to pre-train the sentence-transformers used within this study are mostly large web corpora and thus might not capture domain specific nuances. We thus propose future research should focus on building domain specific sentencetransformer models. In the case of landscape research, we could use corpora generated through the proposed workflow to train a new sentence-transformer model with landscape relevant natural language (e.g. LANDBert).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.