Abstract
Natural language has proven to be a valuable source of data for various scientific inquiries including landscape perception and preference research. However, large high quality landscape relevant corpora are scare. We here propose and discuss a natural language processing workflow to identify landscape relevant documents in large collections of unstructured text. Using a small curated high quality collection of actively crowdsourced landscape descriptions we identify and extract similar documents from two different corpora (Geograph and WikiHow) using sentence-transformers and cosine similarity scores. We show that 1) sentence-transformers combined with cosine similarity calculations successfully identify similar documents in both Geograph and WikiHow effectively opening the door to the creation of new landscape specific corpora, 2) the proposed sentence-transformer approach outperforms traditional Term Frequency - Inverse Document Frequency based approaches and 3) the identified documents capture similar topics when compared to the original high quality collection. The presented workflow is transferable to various scientific disciplines in need of domain specific natural language corpora as underlying data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction and Background
In geography and the spatial humanities, research using unstructured natural language texts as a starting point for detailed inquiries has received increasing attention. Natural language texts have been identified as rich data sources in numerous spatial disciplines ranging from disaster management [1, 2], over epidemiology [3, 4] to landscape perception research [5,6,7] and have been found to frequently contain implicit or explicit spatial information. Commonly, corpora of natural language serve as underlying datasets, from which relevant information is extracted using Natural Language Processing (NLP) and human annotation. An important limitation with respect to such research are the availability of corpora relevant to specific research questions, also referred to as domain specific corpora. In contemporary research, different approaches in terms of underlying datasources are common:
-
1.
Using very large general corpora (e.g. Wikipedia [8], Reddit [9])
-
2.
Using more specific corpora with respect to domain and genre (e.g. Geograph [10]; micro-blogs and news articles [11])
-
3.
Using social media data (e.g. Twitter [2], Hikr [12]), often collected using simple keyword search or other parameters (e.g. with respect to a location)
-
4.
Generating corpora (e.g. TellUsWhere [13]; Window Expeditions [14]) for specific tasks
In this paper we leverage a small, high quality corpus of natural language landscape descriptions to identify similar domain specific documents in other corpora. Identifying relevant documents is a common topic of interest in information retrieval [cf. 15] and within this paper we address identifying documents for specific research questions. Building corpora with the identified documents is crucial in various natural language based studies ranging from natural language generation systems [16], over improving computational identification of irony and sarcasm [17] to term disambiguation [18]. However, approaches of building corpora by extracting relevant documents from existing text sources in the domain of landscape perception and preference research specifically, and with respect to answering geographic questions more generally, remain scarce, limiting the possibilities for natural language based inquiries into how landscapes are perceived and valued.
How we perceive landscapes has been found to influence our well-being and guide our behaviour [19, 20]. As such, compiling landscape specific corpora of natural language could further our understandings of how we perceive and interact with our surroundings, potentially guiding future policies and decisions. As with corpus based research in general, one of the most limiting factors in landscape perception and preference research is the availability of high quality datasets about our surroundings. Although sensor based datasets (e.g. temperature, humidity, spectral reflectance) are relatively easy to collect and are thus plentiful, experiential and perceptual datasets describing more qualitatively how individuals interact with and value landscapes remain scarce and are more difficult to collect [cf. 21].
Studies indicate the connection between natural language and landscape perception runs deep [22, 23] making natural language a valuable source of perceptual and experiential information about our environment [5, 6, 23, 24]. Analysing natural language has become commonplace in landscape perception and preference studies, ranging from identifying salient landscape characteristics through crowdsourced image descriptions in the Lake District [5], over identifying most prominent environmental features using alpine year books [12] and extracting important landscape dimensions from short stories [24], to exploring fictive motion in landscapes in hiking blogs [25]. Many of these approaches rely on making unstructured text machine readable, either through qualitative coding and human annotation or by means of natural language processing [26, 27]. However, they rely on having corpora available which contain relevant documents - the focus of this paper lies on effectively identifying such documents in large corpora.
The internet hosts an abundance of unstructured natural language, potentially containing landscape relevant information [cf. 21]. However, non-specific corpora commonly contain many irrelevant documents which need to be identified and removed. In order to ensure the landscape relevance of the underlying natural language data, moderating content achieves best results. However, moderation is only feasible in rather small collections of natural language due to the time needed to assess each document [28]. Moderating participatory data generation is thus limited to efforts resulting in small high quality datasets. The question thus arises, how can equally relevant documents in large collections of unstructured natural language be computationally identified? This requires that we numerically capture notions of document content, and one effective approach to approximating these semantics are vector representations [29, 30], which also allow for computational comparisons using cosine similarity scores [6, 29, 31]. Cosine similarity is a frequently used metric of similarity between multidimensional vectors and has been used in various natural language processing tasks ranging from clustering biomedical articles [32] to gauging landscape perception for small corpora [6]. Text vectorisation can take many forms, from simple binary vectors recording presence or absence of terms, through Term Frequency - Inverse Document Frequency (TF*IDF) weightings, and more recently approaches using Word2Vec [33] and GLoVE [34] which do not only rely on term matching. A further recent addition to vectorisation methods are Bidirectional Encoder Representations from Transformers (BERT), which mark a significant improvement from previous approaches [35]. However, using BERT to translate a high quality actively crowdsourced landscape relevant corpus to machine readable vectors and identifying similar documents in large web corpora through cosine similarity calculations has not been attempted. Having such a workflow would allow for the creation of larger datasets relevant for landscape perception and preference research.
The mentioned limitations in combination with contemporary natural language and computational linguistics methods have led to the formulation of following research questions:
-
RQ1: How can we use a small collection of high quality natural language landscape descriptions to identify relevant documents in large collections of text through natural language processing?
-
RQ2: How do the identified documents compare in terms of quality and landscape relevance?
2 Data
To collect an initial high quality dataset, we developed and implemented an active crowdsourcing platform named Window Expeditions [cf. 36]. The platform allows interested individuals to upload representative in-situ natural language descriptions of their everyday lived landscapes in three languages, English, German and French. The contributed landscape descriptions are moderated to ensure a resulting high quality dataset. Within this study, we use the first 427 English natural language landscape descriptions contributed to Window Expeditions as our initial dataset.
Since we are interested in identifying similar documents in large corpora of unstructured text, especially in web content, we included two very different additional text collections (cf. Table 1). On the one hand we chose Geograph [cf. 10], an actively crowdsourced collection containing millions of representative landscape images and descriptions. On the other hand, we include a corpus from a completely different domain, WikiHow [cf. 37] containing common questions and answers to a variety of topics. We assumed that Geograph was much more likely to contain domain relevant descriptions to our task and included WikiHow as a control containing relatively short, often informally written conversational texts.
3 Methods
In order to identify landscape relevant natural language in large collections of unstructured text, we propose a workflow based on methods of annotation, natural language processing and sentence-transformers. All methods are combined to a semi-automated and scalable workflow (Figure 1) of which the specifics are presented in the following.
Our starting point was a small collection of curated actively crowdsourced natural language landscape descriptions from the active crowdsourcing platform Window Expeditions as well as a corpus of landscape image descriptions (Geograph) and a corpus of answers to common questions (WikiHow) (cf. Section 2). The documents in the source collection (Window Expeditions) as well as the additional corpora of unstructured text (Geograph and WikiHow) were translated to a vector space. We used HuggingFace’sFootnote 1 implementation of sentence-transformers, a machine learning technique of translating unstructured natural language to machine readable representations whilst retaining underlying semantic information [38]. HuggingFace’s sentence-transformers are largely based on BERT [35] and incorporate the model all-mpnet-base-v2Footnote 2, which is based on the microsoft/mpnetbase [39] and fine-tuned with 1 billion sentence pairs from a variety of datasources including Flickr [40], Yahoo AnswersFootnote 3, WikiAnswersFootnote 4 and Reddit [41]. The presented sentence-transformers take a document as an input, truncate the document to 384 tokens and return a vector containing 768 signed decimal numbers.
In a further step, we calculated cosine similarity scores between all Window Expeditions document vectors and Geograph as well as WikiHow vectors. Cosine similarity scores are a popular measure of similarity between multi-dimensional vectors and are also used by HuggingFace to calculate document similarity. Since the vectors generated through sentence-transformers encapsulate underlying semantics, higher cosine similarity scores have been found to indicate higher convergence of underlying semantics between the represented texts [cf. 6, 31, 42].
In order to compare the results of the sentence-transformer approach to a purely lexical baseline, we repeated the steps above using a TF*IDF approach. We first clean the raw data by transforming to lower case, eliminating stopwords and special characters and performing lemmatisation. We subsequently build document frequency dictionaries and calculate the TF*IDF value of each term in each document. Finally, using terms that are found more than twice in the Window Expeditions corpus (\(N = 796\)), we build document vectors where the indices represent the identified terms and the vector values are the TF*IDF values of respective terms. This results in a 796 dimensional vector for each document in each of the three corpora (Window Expeditions, Geograph, WikiHow). These were used to calculate cosine similarity scores between all documents of all corpora.
In a final step we were interested in the topics and themes captured within our actively crowdsourced corpus of landscape descriptions and how these compared to salient topics in the corpora of identified similar documents. We performed Latent Dirichlet Allocation (LDA) topic modelling (number of topics = 10; iterations = 500) to identify clusters of terms belonging to different topics. We identified ten clusters as optimal by comparing coherence scores of the resulting models when specifying one to ten clusters. For each topical cluster we created a multi-dimensional vector representing the probabilities of terms to be contained within respective clusters. By calculating cosine similarity scores between all vectors representing Window Expeditions topic clusters and vectors representing Geograph topic clusters, we created a 10 x 10 matrix of cosine similarity scores and identified topics showing highest similarities. We present and discuss the three most similar topical clusters in further detail and visualise the terms and their probabilities with word clouds.
4 Results
4.1 Similarity Judgement and Evaluation
To investigate the effects of using a sentence-transformer based approach as opposed to our baseline method TF*IDF, we compared the cosine similarity score distributions when using sentence-transformers and TF*IDF between Window Expeditions and Geograph as well as Window Expeditions and WikiHow (cf. Figure 2). The results of the sentence-transformer approach show much lower mean cosine similarity scores for WikiHow, demonstrating that this collection did indeed contain less relevant content than Geograph (cf. Table 2). In addition, the baseline TF*IDF approach was found to have lower mean cosine similarity scores for both collections.
To evaluate the performance of both the sentence-transformer based approach and the baseline TF*IDF approach, we extracted the Geograph documents showing highest cosine similarity scores for each Window Expeditions contribution for both the sentence-transformer and the baseline TF*IDF approach. From these, we then randomly selected 50 Window Expeditions documents, and designed two simple evaluation tasks. Firstly, for each of the selected Window Expeditions contributions we judged whether document A or B was more similar, where A and B were either the top-ranked Geograph document selected by the sentence-transformer approach or TF*IDF. Secondly, for each document A and B we judged their similarity (relevance) to the original Window Expeditions document on a ternary graded relevance scale. After initial discussions and some refinement of the annotation rules we reached Cohen’s Kappa values of 0.39 for the first comparative test, and 0.32 for the second (after reducing the scale to a binary one, since very few documents were judged to be very similar). These Cohen’s Kappa values are typically interpreted as being fair. Given the complexity of our task – judging that documents discussed similar landscape settings, themes or conditions – we deemed these values to be sufficient, and one author then annotated a further 300 documents in the same way.
The results of our evaluation firstly showed that for 350 comparisons, in 153 cases the most similar document returned was judged to be a better match when using the sentence-transformer approach than using the baseline TF*IDF approach. In 19 cases, the most-similar document found by the TF*IDF approach was a better match, while in 158 both were considered equally similar and in 20 neither were a good match. These results clearly demonstrate that over the collection as a whole, the sentence-transformer based approach’s performance was better than our simple baseline.
In Figure 3 we show the relationship between cosine similarity and our ternary relevance judgements using box-plots. There is a clear, and statistically significant relationship between documents judged as non-similar/similar and the distribution of cosine similarity values for documents identified using sentence-transformers using the Welch Two Sample t-test (similarity judgements 1 vs. 2: \(t = -4.1961\), \(df = 27.352\), \(p < 0.01\); similarity judgements 1 vs. 3: \(t = -5.1503\), \(df = 30.521\), \(p < 0.01\); similarity judgements 2 vs. 3: \(t = -2.3223\), \(df = 284.89\), \(p = 0.0209\)). By contrast, when comparing similarity judgements to cosine similarity values generated using TF*IDF we find no significant relationships (similarity judgements 1 vs. 2: \(t = -1.9499\), \(df = 233\), \(p = 0.0524\); similarity judgements 1 vs. 3: \(t = -0.2862\), \(df = 106.75\), \(p = 0.7753\); similarity 2 vs. 3: \(t = -1.8486\), \(df = 124.73\), \(p = 0.0669\)). These results demonstrate that not only does our sentence-transformer based method perform better in identifying documents judged to be similar, but also that top-ranked documents identified by this method are more likely to be relevant if they have higher cosine similarities. By inspecting our box plots we suggest that a cosine-similarity threshold of 0.7 is an appropriate value at which documents are more likely to be judged similar. Since our ternary relevance judgements are for top-ranked documents, we can also use them to calculate P@1. For sentence-transformers P@1 was 0.93, for TF*IDF 0.69.
4.2 Exploring most Similar Documents
After investigating how our proposed approach of using sentence-transformers compares with a simple baseline, we explored the qualities of documents with high cosine similarity scores as calculated using sentence-transformers in more detail. In particular, we were interested in the number of documents found to have similarity scores of 0.7 and above, since this was suggested by Figure 3 to be an appropriate value with which to identify similar documents. The results show the proposed workflow could extract many more documents both in terms of absolute numbers and percentages in Geograph than in WikiHow (cf. table 3). This further strengthens the argument that sentence-transformers and a small high-quality natural language landscape specific dataset can be used to identify rich landscape relevant documents in other collections.
An initial qualitative inspection of the most similar Geograph and WikiHow documents (according to the performed cosine similarity calculations with sentence-transformer vectorisation) showed the identified texts did indeed capture landscape relevant semantics for both Geograph and WikiHow. However, the style of writing is notably different with Geograph containing descriptive natural language and WikiHow containing explanatory natural language, which is to be expected when considering the domain and genre of respective corpora, with Geograph texts describing locations, while WikiHow provides answers to specific questions.
Examples of Identified Similar Geograph documents:
-
1.
When it come to autumn colour, beeches are the most colourful of our native species, retaining their leaves longer than most other species, and displaying a range of vivid gold, orange and russet until well into November. (cosine similarity = 0.85) (By Anne Burgess)
-
2.
Snow, which I believe fell for about 15 hours the previous day and night, has been removed from the tops of the trees by strong winds, but it has adhered to the eastern side of trunks and branches, and lies thick on the ground. It’s been excellently scrunchy snow for snowmen and snowballs. This view is from where [5715067] was taken. (cosine similarity = 0.84) (By Derek Harper)
-
3.
Beautiful sand, interesting surf, good rocky outcrops make this an excellent beach. With little wind there is still surf, with some westerlies this beach gets exciting. (cosine similarity = 0.83) (By Peter Church)
Examples of Identified Similar WikiHow Documents:
-
1.
The sound of a car without a muffler chugging down the street is never welcome, so make sure you’re not the one causing noise pollution in your neighborhood and spring to get your car fixed. Keeping your car in good, quiet working order will be appreciated by everyone who lives near you. The same goes for your lawnmower and any other noisy equipment you might use outdoors. To have an even greater impact on noise in your area, consider walking or biking instead of using a car whenever possible. (cosine similarity = 0.74)
-
2.
As urban development stretches into rural areas, the noise level increases. Construction sites, airports, train stations, and highways are all sources of loud noises that grate on the ears. If you know the sources of noise pollution in your area, you can do your best to avoid them or find ways to mitigate their negative effect. When you’re choosing a place to live, see if the residence is in a flight path or near a busy highway. During the day the sounds might not bother you, but at night they might prevent you from sleeping. (cosine similarity = 0.74)
-
3.
Being in the sunshine and fresh air has proven health benefits, from easing depression to improving your outlook on life.Go for a walk, take some photographs, or simply sit on your porch to enjoy the benefits of fresh air. If you live somewhere too cold to go outside, consider getting a sunlamp to compensate for the lack of daylight. (cosine similarity = 0.70)
The results show that Geograph representative landscape image descriptions can be very similar to the contributions to Window Expeditions, addressing very similar themes about what people perceive in landscapes, and sometimes written in similar styles.The identified WikiHow documents, despite the different style of writing, address similar themes to some of the Window Expeditions contributions such as soundscapes and noise (a frequent topic found in in-situ landscape descriptions [7, 10]) or the general benefits of being outdoors. This initial qualitative inspection of the resulting identified documents goes shows that the proposed workflow identifies documents rich in landscape information.
To further investigate the similarities between documents, we calculate the number of identified similar documents for each individual document in Window Expeditions (cf. Figure 4). The results show that many documents have a small number of identified similar documents and a small number having many similar documents. This suggests that a small number of documents (visible as steps of high increase in cumulative sum in Figure 4) are particularly important in identifying similar documents in the other collection, hinting at particularly salient topics. In the following, examples of such documents are shown with the respective number of identified similar documents in the other collection.
Window Expeditions Documents with the Highest Number of Similar Geograph Documents
-
1.
A view over open fields with boundaries of hedges and trees. Rising hills out of the low valley. (number of similar Geograph documents = 1895)
-
2.
There are two roads, both full of cars. There is the church across the street, and power lines obscuring my view. I can see some various road signs, and a lot of tress, If I really focus I can see a parking lot for UNCC - South Deck I believe. (number of similar Geograph documents = 621)
-
3.
I can see a small fragment of car park with streetlights. Beyond that is an area of grassland with a few scrubby trees. Then in the distance I can see a hillside with two wind turbines on the top of the hill. The land cover on this hillside is mainly agricultural land with a few scattered trees and linear hedgerows. (number of similar Geograph documents = 412)
Window Expeditions documents with the highest number of similar Geograph image descriptions often share terms such as “view”, “hill”, “tree” and “car” between one or all documents. This suggests that the Geograph collection hosts many image descriptions revolving around what contributors see, the topography of a given area and mentions of transportation and vehicles.
To further investigate the similarity between Window Expeditions and the identified similar documents, we performed latent Dirichlet allocation (LDA) topic modelling. We only applied LDA topic modelling to the identified similar Geograph documents given the very small number of similar WikiHow texts found. After comparing all topics in the two corpora using cosine similarity scores (Figure 5), we identified three topical clusters showing the highest similarities between Window Expeditions and Geograph and explore these further.
The topic showing highest similarity between Window Expeditions and Geograph (cosine similarity score = 0.67) revolves around the general theme of snow and weather related phenomena. Particularly salient are the terms associated with winter and snow “snow”, “winter”, “cold” and “cover” as well as terms referring to other weather related phenomena including “sun” and “rain” (Figure 6). This suggests that weather and in particular snow are important topics captured both within the Window Expeditions corpus as well as within the identified similar Geograph documents.
A further salient topic found in both Window Expeditions and Geograph (cosine similarity score = 0.63) and identified through LDA topic modelling is the general theme of the countryside and rural areas (Figure 7). Prominent shared terms between the clusters include “hill” as well as “field” and additional terms relating to rural areas such as “horse”, “cattle”, “pasture” and “farmland” are found in one or the other cluster. This suggests that rural and more natural areas are important topics in both the original Window Expeditions as well as the newly generated corpora.
Finally, we find the topic of urban and residential areas salient in both Window Expeditions as well as Geograph (cosine similarity score = 0.66). Frequent terms reflect common elements found in everyday lived landscapes and include “house”, “garden”, “home”, “road”, and “tree” (Figure 8). These suggest that both corpora capture natural and rural as well as urban and residential landscapes and highlights the fact that both corpora capture similar semantics regarding landscapes.
5 Discussion
The results of generating a large landscape relevant corpus of natural language using a small curated high quality dataset point towards a number of interesting observations which we discuss in more detail below. Specifically we discuss the properties of the generated corpora and we explore the topics that emerge from the data. Further, we present the limitations of this study and potential avenues of future work.
Using a high quality domain specific dataset - in this case actively crowdsourced in-situ natural language landscape descriptions - as a basis for identifying similar documents in other corpora through sentence-transformers was found to successfully generate new domain-specific corpora. Sentence-transformers encapsulate a document’s meaning into a representative machine readable vector [35], however, questions of model training and nuances in language remain.
The proposed approach using sentence-transformers was found to identify more similar documents, as judged by human annotators, compared to the baseline approach of using TF*IDF. The better performance of sentence-transformers somewhat contradicts findings of similar studies where TF*IDF based vectorisation approaches have been found to outperform more recent word embedding approaches [cf. 45, 46, 47]. A potential explanation is the domain specificity of the initial Window Expeditions dataset and the target Geograph corpus. Since both are landscape relevant and thus the language is domain specific, sentence-transformer based vectorisation appears to encapsulate more underlying semantic information into the resulting vectors than the baseline TF*IDF approach. This is to be expected seeing TF*IDF is based on a bag-of-words approach and sentence-transformers use large pre-trained language models.
The past three decades have seen a shift from top-down expert based landscape perception research to bottom-up approaches of landscape characterisations [48,49,50]. In line with contemporary frameworks, modern approaches incorporate participatory data generation efforts. Including the views, values and perceptions of a heterogeneous group of individuals in landscape perception and preference research is crucial to understanding various human-landscape interactions and has seen increased attention [cf. 7, 21, 51]. However, participatory data generation efforts can be time consuming and costly, limiting spatial coverage of the underlying data.
With increasing global internet access, an abundance of user generated content has been created and stored. These vasts amounts of unstructured natural language potentially contain many documents highly relevant for certain scientific inquiries, in this case landscape perception. However, identifying potentially relevant documents in large text collections remains challenging. Landscape relevant documents have been extracted from large corpora using rule based filtering techniques [10], however, using a small high quality corpus in combination with sentence-transformers has not been attempted. The results of this study show that the proposed workflow is indeed able to identify similar documents in corpora of the same domain (Geograph) as well as in large web corpora of a different domain (WikiHow). This strengthens the argument that sentence-transformers do indeed capture underlying semantics, even in specific domains, and that the proposed workflow can be used to create large corpora of landscape relevant data. Going beyond landscape perception research, the proposed workflow could potentially be used in a variety of projects.
To further investigate the topics captured within the identified documents, we perform Latent Dirichlet Allocation topic modeling and compare the resulting clusters. The results show that both the Window Expeditions and the Geograph corpora capture similar themes as reported by participants. The identified topics revolve around snow and weather, rural and natural landscapes and urban and residential areas. Interestingly, particularly salient landscape dimensions identified in the literature are mountains, water bodies and recreational areas [cf. 52, 53]. The generated topic revolving around rural and natural landscapes captures these concepts, however, we also identified urban and residential areas to be particularly important, calling for further research on how the more immediate surroundings, the everyday lived landscapes, of participants are perceived. The results of the LDA topic modelling underline the potential of the proposed workflow, seeing that both the Window Expeditions as well as the identified similar Geograph documents capture similar topics and landscape dimensions.
5.1 Limitations
Using a small curated high quality collection of natural language texts, the proposed workflow is able to successfully identify similar documents in large corpora as a means of creating large domain and genre specific datasets. However, the approach is accompanied by three key limitations.
Firstly, the literature agrees that language and culture are intertwined at a fundamental level [23, 54]. However, many studies on landscapes and on cultural differences in general are conducted in and written for an English speaking population. Seeing we tested the proposed workflow on the English language and the limited availability of pre-trained sentence-transformer models in other languages, the proposed workflow may only achieve high quality results in English. The lack of multi-lingual investigations into landscapes calls for more cross-cultural and cross-linguistic explorations of language and landscape. Furthermore, since our approach uses user generated content as a basis, it is subject to the well known limitations of this form (e.g. participation inequality and potential biases in the demographics of contributors) [55] and thus our texts may not be representative of all ways in which landscapes are perceived, and are likely to biased towards popular locations [56].
Secondly, the methodological approach leads to a high dimensionality of TF*IDF and sentence-transformer derived vectors. Seeing the TF*IDF vectors are sparse with most values being 0 (term not present in respective Window Expeditions and Geograph document), the resulting cosine similarity calculations for TF*IDF are heavily biased towards 0. Using dimensionality reduction techniques such as t-SNE [57] could potentially reduce this bias leading to better baseline results. In addition, Window Expeditions is not geographically limited in scope, and we refrain from using spatially explicit criteria for our similar document identification workflow. One approach might be to use geographically trained language models [cf. 58] as an additional criteria for judging similarity. Furthermore, truncating documents to 384 tokens for calculation of the sentence-transformer derived vectors may also result in some loss of information.
A final major limitation of the proposed workflow to generate large domain specific natural language corpora is the black-box nature of many widely adopted machine learning NLP approaches. The resulting vectors have no human significance and it is thus impossible to retrace the path of initial individual documents to the results. Slight changes to the underlying sentence-transformer model could eliminate any reproducibility. In addition, the underlying training data used to pre-train the sentence-transformers used within this study are mostly large web corpora and thus might not capture domain specific nuances. We thus propose future research should focus on building domain specific sentence-transformer models. In the case of landscape research, we could use corpora generated through the proposed workflow to train a new sentence-transformer model with landscape relevant natural language (e.g. LANDBert).
Notes
www.huggingface.co (accessed: 06.09.2022).
www.huggingface.co/sentence-transformers/all-mpnet-base-v2 (accessed: 06.09.2022).
www.kaggle.com/datasets/soumikrakshit/yahoo-answers-dataset (accessed: 06.09.2022).
www.github.com/afader/oqa#wikianswers-corpus (accessed: 06.09.2022).
References
Sit MA, Koylu C, Demir I (2019) Identifying disaster-related tweets and their semantic, spatial and temporal context using deep learning, natural language processing and spatial analysis: a case study of hurricane irma. Int J Digital Earth 12(11):1205–1229. https://doi.org/10.1080/17538947.2018.1563219
Zahra K, Imran M, Ostermann FO (2020) Automatic identification of eyewitness messages on twitter during disasters. Inform Process Manage 57(1):1–15. https://doi.org/10.1016/j.ipm.2019.102107
Klein AZ, Cai H, Weissenbacher D, Levine LD, Gonzalez-Hernandez G (2020) A natural language processing pipeline to advance the use of twitter data for digital epidemiology of adverse pregnancy outcomes. J Biomed Inform 112:1–9. https://doi.org/10.1016/j.yjbinx.2020.100076
Klein AZ, Magge A, O’Connor K, Flores Amaro JI, Weissenbacher D, Gonzalez Hernandez G (2021) Toward using twitter for tracking covid-19: A natural language processing pipeline and exploratory data set. J Med Internet Res 23(1):1–6. https://doi.org/10.2196/25314
Koblet O, Purves RS (2020) From online texts to landscape character assessment: collecting and analysing first-person landscape perception computationally. Landsc Urban Plann 197:1–16. https://doi.org/10.1016/j.landurbplan.2020.103757
Wartmann FM, Purves RS (2018) Investigating sense of place as a cultural ecosystem service in different landscapes through the lens of language. Landsc Urban Plann 175:169–183. https://doi.org/10.1016/j.landurbplan.2018.03.021
Wartmann FM, Koblet O, Purves RS (2021) Assessing experienced tranquillity through natural language processing and landscape ecology measures. Landsc Ecol 36(8):2347–2365. https://doi.org/10.1007/s10980-020-01181-8
Ardanuy MC, Sporleder C (2017) Toponym disambiguation in historical documents using semantic and geographic features. In: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage. DATeCH2017. Association for Computing Machinery, New York, NY, USA ., pp. 175–180 https://doi.org/10.1145/3078081.3078099
Fox N, Graham LJ, Eigenbrod F, Bullock JM, Parks KE (2021) Reddit: A novel data source for cultural ecosystem service studies. Ecosyst Serv 50:1–14. https://doi.org/10.1016/j.ecoser.2021.101331
Chesnokova O, Purves RS (2018) From image descriptions to perceived sounds and sources in landscape: Analyzing aural experience through text. Appl Geogr 93:103–111. https://doi.org/10.1016/j.apgeog.2018.02.014
Do Y (2019) Valuating aesthetic benefits of cultural ecosystem services using conservation culturomics. Ecosyst Serv 36:1–5. https://doi.org/10.1016/j.ecoser.2019.100894
Derungs C, Purves RS (2016) Characterising landscape variation through spatial folksonomies. Appl Geograp 75:60–70. https://doi.org/10.1016/j.apgeog.2016.08.005
Richter D, Winter S, Richter K-F, Stirling L (2012) How people describe their place: Identifying predominant types of place descriptions. In: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information. GEOCROWD . Association for Computing Machinery, New York, USA . pp. 30–37 https://doi.org/10.1145/2442952.2442959
Thibault M, Baer MF (2021) Urban gamification during lockdown and social isolation -from the teddy bear challenge to window expeditions. In: Bujić, M., Koivisto, J., Hamari, J. (eds.) Proceedings of the 5th International GamiFIN Conference, pp. 130–139
Benedetti F, Beneventano D, Bergamaschi S, Simonini G (2019) Computing inter-document similarity with context semantic analysis. Inform Syst 80:136–147. https://doi.org/10.1016/j.is.2018.02.009
Gardent C, Shimorina A, Narayan S, Perez-Beltrachini L (2017) Creating training corpora for NLG micro-planners. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Vancouver, Canada . pp. 179–188. https://doi.org/10.18653/v1/P17-1017
Filatova E (2012) Irony and sarcasm: Corpus generation and analysis using crowdsourcing. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) . European Language Resources Association (ELRA), Istanbul, Turkey. pp. 392–398
Saeed A, Nawab RMA, Stevenson M, Rayson P (2019) A sense annotated corpus for all-words urdu word sense disambiguation. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3314940
Abraham A, Sommerhalder K, Abel T (2010) Landscape and well-being: A scoping study on the health-promoting impact of outdoor environments. Int J Pub Health 55(1):59–69. https://doi.org/10.1007/s00038-009-0069-z
Thompson CW (2011) Linking landscape and health: The recurring theme. Landsc Urban Plan 99(3–4):187–195. https://doi.org/10.1016/j.landurbplan.2010.10.006
Bubalo M, Zanten BTV, Verburg PH (2019) Landscape and Urban Planning Crowdsourcing geo-information on landscape perceptions and preferences : A review. Landsc Urban Plann 184:101–111. https://doi.org/10.1016/j.landurbplan.2019.01.001
Mark DM, Turk AG, Burenhult N, Stea D (2011) (eds): Landscape in language: Transdisciplinary perspectives
van Putten S, O’Meara C, Wartmann F, Yager J, Villette J, Mazzuca C, Bieling C, Burenhult N, Purves R, Majid A (2020) Conceptualisations of landscape differ across European languages. PLoS ONE 15(10):1–16. https://doi.org/10.1371/journal.pone.0239858
Bieling C (2014) Cultural ecosystem services as revealed through short stories from residents of the Swabian Alb (Germany). Ecosyst Serv 8:207–215. https://doi.org/10.1016/j.ecoser.2014.04.002
Egorova E, Tenbrink T, Purves RS (2018) Fictive motion in the context of mountaineering. Spat Cogn Comput 18(4):259–284. https://doi.org/10.1080/13875868.2018.1431646
Hsieh HF, Shannon SE (2005) Three approaches to qualitative content analysis. Qual Health Res 15(9):1277–1288. https://doi.org/10.1177/1049732305276687
Pustejovsky J, Stubbs A (2013) Natural Language Annotation for Machine Learning – A Guide to Corpus-building for Applications, pp. 1–343
Ghosh A, Kale S, McAfee P (2011) Who moderates the moderators? crowdsourcing abuse detection in user-generated content. In: Proceedings of the 12th ACM Conference on Electronic Commerce. Association for Computing Machinery, New York, USA, pp. 167–176. https://doi.org/10.1145/1993574.1993599
Yenicelik D, Schmidt F, Kilcher Y (2020) How does BERT capture semantics? A closer look at polysemous words, pp. 156–162 . https://doi.org/10.18653/v1/2020.blackboxnlp-1.15
Ethayarajh K, Duvenaud D, Hirst G (2020) Towards understanding linear word analogies. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pp. 3253–3262 . https://doi.org/10.18653/v1/p19-1315
Li B, Han L (2013) Distance weighted cosine similarity measure for text classification. Lect Notes Comput Sci 8206:611–618. https://doi.org/10.1007/978-3-642-41278-3_74
Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, Schijvenaars B, Skupin A, Ma N, Börner K (2011) Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLoS ONE 6(3):1–11. https://doi.org/10.1371/journal.pone.0018029
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12 arXiv:1301.3781
Pennington J, Socher R, Manning CD (2014) GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1(Mlm), pp. 4171–4186 arXiv:1810.04805
Baer MF, Purves RS (2022) Window expeditions: A playful approach to crowdsourcing natural language descriptions of everyday lived landscapes. Appl Geogr 148:1–15. https://doi.org/10.1016/j.apgeog.2022.102802
Koupaee M, Wang WY (2018) WikiHow: A Large Scale Text Summarization Dataset arXiv:1810.09305
Reimers N, Gurevych I (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
Song K, Tan X, Qin T, Lu J, Liu T-Y (2020) MPNet: Masked and Permuted Pre-training for Language Understanding. 34th Conference on Neural Information Processing Systems
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78. https://doi.org/10.1162/tacl_a_00166
Henderson M, Budzianowski P, Casanueva I, Coope S, Gerz D, Kumar G, Mrkšić N, Spithourakis G, Su P-H, Vulić I, Wen T-H (2019) A Repository of Conversational Datasets, pp. 1–10 arXiv:1904.06472. https://doi.org/10.18653/v1/w19-4101
Han J, Kamber M, Pei J (2012) Getting to Know Your Data. Data Mining, pp. 39–82 . https://doi.org/10.1016/b978-0-12-381479-1.00002-2
Ling RF (1974) Comparison of several algorithms for computing sample means and variances. J Am Stat Assoc 69(348):859–866. https://doi.org/10.1080/01621459.1974.10480219
Chan TF, Golub GH, Leveque RJ (1983) Statistical computing: Algorithms for computing the sample variance: Analysis and recommendations. Am Stat 37(3):242–247. https://doi.org/10.1080/00031305.1983.10483115
Sitikhu P, Pahi K, Thapa P, Shakya S (2019) A comparison of semantic similarity methods for maximum human interpretability. In: 2019 Artificial Intelligence for Transforming Business and Society (AITB), pp. 1–4 . https://doi.org/10.1109/AITB48515.2019.8947433
Singh AK, Shashi M (2019) Vectorization of text documents for identifying unifiable news articles. Int J Adv Comput Sci Appl . https://doi.org/10.14569/IJACSA.2019.0100742
Marcińczuk M, Gniewkowski M, Walkowiak T, Bedkowski M (2021) Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In: Proceedings of the 11th Global Wordnet Conference, pp. 207–214. Global Wordnet Association, University of South Africa (UNISA)
Tudor C (2014) An Approach to Landscape Character Assessment. Natural England ,pp. 1–56
(2000) European Landscape Convention: European Landscape Convention. Report and Convention Florence ETS No. 17(176): 8 . http://conventions.coe.int/Treaty/en/Treaties/Html/176.htm
Antrop, M (2013) A brief history of landscape research. The Routledge companion to landscape studies, pp. 12–22
Derungs C, Purves RS (2014) From text to landscape: Locating, identifying and mapping the use of landscape features in a Swiss Alpine corpus. Int J Geogr Inf Sci 28(6):1272–1293. https://doi.org/10.1080/13658816.2013.772184
Wherrett JR (2000) Creating landscape preference models using internet survey techniques. Landsc Res 25(1):79–96. https://doi.org/10.1080/014263900113181
Fagerholm N, Martín-López B, Torralba M, Oteros-Rozas E, Lechner AM, Bieling C, Stahl Olafsson A, Albert C, Raymond CM, Garcia-Martin M, Gulsrud N, Plieninger T (2020) Perceived contributions of multifunctional landscapes to human well-being: Evidence from 13 European sites. People Nat 2(1):217–234. https://doi.org/10.1002/pan3.10067
Kramsch C (2014) Language and culture. AILA Rev 27(1):30–55
Li L, Goodchild MF, Xu B (2013) Spatial, temporal, and socioeconomic patterns in the use of twitter and flickr. Cartogr Geogr Inf Sci 40(2):61–77. https://doi.org/10.1080/15230406.2013.777139
Hartmann MC, Koblet O, Baer MF, Purves RS (2022) Automated motif identification: Analysing flickr images to identify popular viewpoints in europe’s protected areas. J Outdoor Recreat Tour 37:100479
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605
Roller S, Speriosu M, Rallapalli S, Wing B, Baldridge J (2012) Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1500–1510
Acknowledgements
First and foremost, we would like to thank everyone for their contributions to Window Expeditions, without you this study would not be possible. We thank all of those who participate in the Geograph project (www.geograph.org.uk) for their generousness in making data available under CC BY-SA licence. In addition, we would like to thank the members of the Geocomputation Group for their help with the evaluation and similarity judgements. Finally, we would like to thank the anonymous reviewers for their valuable inputs and comments which improved this paper.
Funding
Open access funding provided by University of Zurich.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflicts of interest.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Baer, M.F., Purves, R.S. Identifying Landscape Relevant Natural Language using Actively Crowdsourced Landscape Descriptions and Sentence-Transformers. Künstl Intell 37, 55–67 (2023). https://doi.org/10.1007/s13218-022-00793-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13218-022-00793-3