Computationally Efficient Labeling of Cancer Related Forum Posts by Non-Clinical Text Information Retrieval

An abundance of information about cancer exists online, but categorizing and extracting useful information from it is difficult. Almost all research within healthcare data processing is concerned with formal clinical data, but there is valuable information in non-clinical data too. The present study combines methods within distributed computing, text retrieval, clustering, and classification into a coherent and computationally efficient system, that can clarify cancer patient trajectories based on non-clinical and freely available information. We produce a fully-functional prototype that can retrieve, cluster and present information about cancer trajectories from non-clinical forum posts. We evaluate three clustering algorithms (MR-DBSCAN, DBSCAN, and HDBSCAN) and compare them in terms of Adjusted Rand Index and total run time as a function of the number of posts retrieved and the neighborhood radius. Clustering results show that neighborhood radius has the most significant impact on clustering performance. For small values, the data set is split accordingly, but high values produce a large number of possible partitions and searching for the best partition is hereby time-consuming. With a proper estimated radius, MR-DBSCAN can cluster 50000 forum posts in 46.1 seconds, compared to DBSCAN (143.4) and HDBSCAN (282.3). We conduct an interview with the Danish Cancer Society and present our software prototype. The organization sees a potential in software that can democratize online information about cancer and foresee that such systems will be required in the future.


Introduction
The three predominating types of diseases in today's world are cancers, respiratory disorders, and cardiovascular diseases. These diseases entail a predictable trajectory for the patients, caretakers, and relatives, which can be summarized as [1,2]: Each of the four sequential steps are however complex and encompass a range of concerns, for example life expectancy, patterns of decline, probable interactions with health related services, treatment plans, medical side effects, palliative care, and more. It is difficult, but important to clarify and communicate these complex trajectories to the patient, caretakers, and the relatives. Better informed individuals have better treatment outcomes due to proper disease management and democratized trajectories can lead to better clinical decisions being made, fewer side effects of treatments and fewer readmissions. Improving the overall care can be done by estimating, clarifying and communicating patient-specific disease trajectories.
Relevant but unused information about diseases and continuity of care is freely available in online communities and forum posts. This information could benefit cancer patients, relatives, or caretakers. Approximately one third of the world's population receive a cancer diagnosis during their lifetime [3]. This creates a large field of potential users that would like to learn more about their diagnosis from others. A cancer diagnosis leads to many different reactions, but most tend to seek information online about the trajectory prognosis. A popular trend is to write and communicate on online forums about health issues [4,5]. On such forums, people write freely on a given topic. On cancer forums, people usually write about their frustrations, experiences, emotions, feelings, and personal preferences regarding any cancer related topic. The established health care systems do not leverage all of this non-clinical information. Some of it may be of clinical relevance, and some only of personal relevance. However, both kinds of information can empower patients, caretakers and relatives by for example strengthening their understanding of a cancer diagnosis, build self-confidence, and establish online or physical communities.
The objective of this present study is to clarify and communicate cancer patient trajectories by information retrieval and subsequent clustering using three common techniques: MR-DBSCAN, DBSCAN, and HDBSCAN. The methods are evaluated on training sets of various sizes (5000 to 25000 posts) in terms of computational efficiency and Rand measure. To our knowledge, this study is the first to perform text retrieval, clustering and classification of cancer patient trajectories in non-clinical texts. The end result is a software prototype that can sift, filter and present cancer information in a visually appealing manner, as we demonstrate with a graphical user-inference. We als oconducted an interview with the Danish Cancer Society [3], who saw a great potential in the presented software prototype and stressed the importance of patient-empowerment. Currently, the digital adoption rate for elderly people in Europe and the US is generally low, but digital maturity is expected to increase in the coming years.

Related Work
Existing research with various objectives, methods and data backgrounds have been addressing the idea of mining data for clarifying and estimating disease trajectories; e.g., natural language processing has a transformative potential within this area (e.g. [6][7][8]). In 2005, Murray et al. [2] carried out a clinical review that describes three typical disease trajectories, namely: cancer, organ failure (heart and lung focus) and frail elderly (dementia focus) and in 2008 Meystre et al. [9] did a review on research within information extraction from clinical notes in narrative style. Studies have shown, that even for data normally viewed as highly distinct, e.g., lab records, a portion of relevant information may only be available as part of clinical text [10]. In a 2010 study, Ebadollahi et al. [11] predict patient trajectories from temporal physiological data, and in a 2014 study by Jensen et al. [12] disease observations across a span of fifteen years from a large patient population were translated into disease trajectories. In 2016 Ji et al. [13] developed prediction models for health condition trajectories and co-morbidity relationships based on social health records, and in 2017 Jensen et al. [1] conducted a text mining study on electronic health records in order to automatically identify cancer patient trajectories. In 2019 Assale et al. [14] reviewed and documented the potential of leveraging the unstructured content in electronic health records. In 2021 Nehme et al. [15] did a study on natural language processing in the domain of gastroenterology primarily focusing on structured text within endoscopy, inflammatory bowel disease, pancreaticobiliary, and liver diseases.
None of the above mentioned studies deal with text information retrieval, distributed clustering, and classification for identifying cancer patient trajectories from non-clinical texts, i.e. online forum posts.
Frunza et al. [16] did a related study in 2011; in their study, they automatically extract sentences from clinical papers about diseases and/or treatments. Based on the extracted sentences, semantic relations between diseases and associated treatments are then identified. Another related study was done by Rosario et al. [17] in 2004. The focus of their work was to recognize text-entities containing information about diseases and treatments. They use Hidden Markov Models and Maximum Entropy Models to perform the entity and disease-treatment relationship recognition. Compared to the Frunza et al. [16] and the Rosario et al. studies [17] that focus mostly on classification, the present study focuses also on text retrieval and clustering. Further, the present study focuses especially on cancer trajectories where the other studies have a broader perspective and aim to cover diseases in general.
In the 2011 study by Yang et al. [18], Density-Based Clustering was used to identify topics within online forum threads on social media. They also developed a visualization tool to provide an overview over the identified topics. The purpose of their tool was to extract topics with sensitive information related to terrorism or other crime activities; however, it might also be tailored to extract other topics. Besides using DBSCAN, the study proposed a related clustering method, namely SDC (Scalable Density based Clustering). The structure of the Yang et al. study is, to some extent, similar to the present study; specifically, in the present study, topics are also extracted from online forum posts, density based clustering is also used, and result visualization capabilities are also provided.

Reading guide
Section 2 presents the overall system architecture. Section 3 presents text information retrieval methods. Section 4 presents the clustering techniques used in this study. In section 5, we detail how further filtering of the clusters are performed as we split them into five categories (Cure, No cure, Disease, Treatment, Side effect and Irrelevant). The results and discussion are presented in sections 6 and 7, respectively.
2 System Architecture

Overview
The present study's developed software solution consists of four components including a database component for storage of clusters. The solution has been designed in a micro-service architecture with one process per component. Figure 1 provides a static overview in terms of a component diagram and figure 2 provides a dynamic overview in terms of an activity diagram.
The User interface (UI) component handles end-user interaction; this is detailed further in section 2.2. The purpose of the API component is primarily to enable easy access for the UI to the Database and the Service components. The Database persists all gathered forum posts and the computed results, e.g. clusters, classes and cancer-trajectories. The Service component is handling the computationally burdensome data processing; the micro-service architecture enables scaling of this component only. By implementing the service component as a scalable unit, it becomes well-suited for the application of a distributed computing approach. Especially the clustering calculations are burdensome and needs to be made efficient. Currently, the text retrieval and classification calculations do not need to be scaled as they are much faster than the clustering.

User Interface
Having a tool to visualize data is helpful for effective exploration of results. The developed user interface is useful for exploring the collected data set of forum posts and to show information from an area of interest. For instance, a user is able to select a cluster, i.e. a cancer-type, of interest, e.g. breast cancer, and only receive posts within that particular cluster. In addition, a user can also choose a class-label, e.g. side effect, and thereby see all posts from the breast cancer cluster that contains information about side effects. Such a tool is both relevant for scientific use and for cancer patients and caretakers. The user interface consists of five main views, namely Search, Posts, Statistics, Clusters, and Tools (figure 1). Edited excerpts from the views Search, Posts and Statistics are seen in figures 3, 4 and 5 respectively. In the Search view, a user can search the entire collection of forum posts; the identified clusters, i.e. cancer-types, are displayed along with treatments mentioned in the posts. By clicking a cancer-type cluster, all posts associated with that particular cancer-type cluster are displayed in the Posts view. Users can browse through the posts within a cancer-type cluster, and by selecting a class-label, i.e. Disease, Treatment, Side effect, Cure, or No cure, only the posts within the cancer-type cluster and with the selected label are displayed. In the Statistics view, all cancer-type clusters are displayed along with their class-label distributions. Also, a histogram showing posts per cancer-type cluster is displayed along with absolute counts of posts, clustered posts, clusters, and class-labels. Thereby, the Statistics view provides a useful overview for the end-user; such an overview is very hard to obtain for any regular end-user reading through forum posts.

Validation
In this study, all developed software has been evaluated against five out of the eight properties defined in the software product quality model specified in the ISO 25010 standard [19]. Concretely, the validation has focused on the following properties and sub-properties:

Data Collection
The data consists of automatically collected posts from a set of publicly available cancer related forums; the posts are written by medical laypersons. Typically, the posts contain some combination of diagnoses, symptoms, experiences, questions, side-effects, treatments and/or treatment outcomes. In this study, each post is saved in the following self explanatory structure: [ thread id, author, title, date, content ] The most interesting piece of information in each post is stored in the content attribute. This attribute contains all of a post's text, and based on this text, information retrieval is done and relevant features for the cancertype clustering are extracted. Often, the post-texts contain rather detailed descriptions of a particular cancer-type and a received treatment. For example, a user on Cancer Survivors Network [20] wrote the following about thyroid cancer: Hi, everyone I was diagnosed with Papillary Thyroid Caner a little more than a year ago. It was right before Christmas last year and I went in to get all of my thyroid removed. It was a 5 hour surgery and I stayed in the hospital for 3 days because of complications with my levels. But, they didn't look at the lymph nodes and the last two ultrasounds I have had revealed lymph nodes in my neck and one in my throat that have gotten bigger. I need some advice for thos who have had their thyroid cancer come back. My treatment of thyrogen shots, lab work, and scans start April 1st, 2013. I know this cancer is the easiest to treat but I'm wondering what happens if it has in fact spread to my lymph nodes and in my body? Does that change the prognosis or treatment or staging? Thanks for the help! Although this representative example text is written by a layperson, it still contains relevant health and cancer related information that might be useful for other patients or caretakers.

Text Retrieval Preprocessing
In order to perform the actual text information retrieval successfully, the text needs to be preprocessed. In this study, we have conducted three preprocessing steps: 1. cleansing, 2. stemming, and 3. tokenization.
In the cleansing part, unwanted characters, e.g. HTML tags, emojis and ASCII-artwork, are removed. This is a non-trivial task when dealing with forum posts as people express themselves quite informally.
In the stemming part, inflected and derived words are reduced to their word stem [21]. Several different algorithms for stemming exists, e.g. the Lovins Stemmer [22], the Paice Stemmer [23], and the predominant Porter Stemmer [24]. All of these stemming algorithms are best suited for English; in the present study, the Porter Stemmer is used. The Porter Stemming algorithm is based on five steps, and in each step, a specified set of rules are applied to the word being processed; for instance, table 1 shows the processing rules of the first step [24]. Table 1 Exemplification of some of the processing rules in the Porter Stemming algorithm.

Rules
Examples In the tokenization part, character and word sequences are sliced into tokens. Typically, the tokens are words or terms, but in this study, tokens are only words. After the tokenization, stop words are removed.

Text Retrieval
For the subsequent clustering of posts into cancer-type clusters to be accurate, information from all the posts' content attribute needs to retrieved. This is done by using text retrieval together with a predefined feature vector containing names of a range of cancer types.
In this study, we use the Term Weighting approach. This approach uses Term Frequency and Inverse Document Frequency to yield Term Frequency -Inverse Document Frequency which is the final weight of a term.
The purpose of Term Frequency (tf) is to measure how often a term occurs in a specific document, i.e. in this study tf is simply an unadjusted count of term appearances.
Definition 1 Term Frequency [25]. tf(t, d) ≡ occurrences of term t in document d.
Clearly, documents vary in length which entails a bias in tf; that is, a term is likely to appear more often in a long document than in a short document, given the documents are similar in content [26]. Whenever a term is frequent in a document it is likely to be important to that specific document.
The purpose of Inverse Document Frequency (idf) is to measure the weight of a term in a collection of documents; a rare term is often more valuable than a frequent term in a collection of documents [27].
Where N is the number of documents in collection D, and n is the number of documents in D in which term t appears.
Term Frequency -Inverse Document Frequency (tf-idf) is a measure of how important a word is to a specific document in a collection of documents. A large tf-idf weight is obtained whenever: 1. the term frequency is high for the specific document, and 2. the document frequency is low for the term across the collection of documents. The combination of the tf and idf weights tend to filter out common terms that do not carry much information.

DBSCAN Clustering
Clustering is the process of splitting an unlabeled data set into clusters of observations with similar traits such that intra-cluster variation is minimized and inter-cluster variation is maximized. A cluster in this study is a collection of similar posts in terms of cancer-type.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a clustering algorithm based on the density of data points (also known as observations). It creates clusters from regions that have a sufficiently high density of data points and in doing so, it allows clusters of any shape even if it contains noise or outliers. This allows DBSCAN to create non-convex and non-linearly separated clusters, contrary to many other clustering algorithms [29,30]. Also, the algorithm is able to find clusters of arbitrary size [31,32] and it does not require the number of clusters beforehand [31][32][33]. Moreover, DBSCAN (variants thereof) is horisontally scalable such that efficient computing can be achieved via a distributed or cluster computing setup which is attractive, and in some cases even necessary, for large scale text data processing. Before outlining the DBSCAN algorithm, a number of associated definitions need to be in place.
Definition 4 DBSCAN related definitions [33]. The ε ε ε-neighborhood of point p is defined by the points within a radius ε of p.
If a point p's ε-neighborhood contains at least m pts number of points, the point p is called a core point.
A point p is directly density-reachable from a point q if p is within the ε-neighborhood of q and q is a core point.
A point p is density-reachable from a point q with regard to ε and m pts if there is a chain of points, p 1 , ..., pn, where p 1 = q and pn = p such that p i+1 is directly density-reachable from p i .
A point p is density-connected to a point q with regard to ε and m pts if there is a point o such that both p and q are density-reachable from o.
A point p is a border point if p's ε-neighborhood contains less than m pts and p is directly density-reachable from a core point.
All points not reachable from any other point are outliers called noise points.
A cluster C is a non-empty set that satisfies the following two conditions for all point pairs (p, q): 1. if p is in C and q is density-reachable from p, then q is also in C; and 2. if (p, q) is in C, then p is density-connected to q.
To establish a cluster, DBSCAN starts with an arbitrary point p and finds all density-reachable points from p with respect to ε and m pts . If p is a core point a new cluster with p as a core point is created. If p is a border point DBSCAN visits the next point in the data collection. DBSCAN may merge two clusters into one, if the clusters are density-reachable [33]. The algorithm terminates when no new points can be added to any cluster [31].

MR-DBSCAN Clustering
Clustering with DBSCAN is computationally burdensome both with regard to run-time and memory consumption [34]. To achieve run-time efficiency, MR-DBSCAN (MapReduce-DBSCAN) [32,35] is used rather than DBSCAN. Besides distributing computations via MapReduce, the two clustering algorithms are equivalent. Figure 6 outlines the steps in MR-DBSCAN.

Partitioning
To maximize parallelism and thus the run-time efficiency gain, data must be well balanced such that data, and thus the computational work-load, can be evenly distributed on the compute-nodes. However, data in real life applications are often unbalanced and this needs to be addressed with a suitable data partitioning strategy; such a strategy is part of MR-DBSCAN.
One possible partitioning strategy is to recursively split the entire data set into smaller sets, i.e. partitions, until a stop criterion is met, e.g. all partitions contain less than a given number of points or a given number of partitions have been made. According to definition 4, the geometry of a cluster, and  therefore sensibly also a partition, cannot be smaller than 2ε, so when splitting a partition, the geometry must remain extended beyond 2 · 2ε. When splitting a partition into two in MR-DBSCAN [32], all possible splits are considered. The split that minimizes the cost in one of the sub-partitions are chosen. Here, cost is the difference between the number of points in sub-partition-1 and half of the number of points in sub-partition-2. Each partition is given a key and associated with a reducer.

Local DBSCAN
A reducer is given a partition and all its associated data points. Therefore, the mapper must prepare all data related to a partition for every single partition. That is, for instance, a partition P i , the related data C i within P i , but also the data within P i 's ε-width extended partition R i that overlap the bordering partitions. In case of a 2D-grid partitioning those bordering sets are: North (N, IN), South (S, IS), East (E, IE), and West (W, IW) (figure 7). Local DBSCAN uses the same principles as DBSCAN to perform its clustering. It starts with an arbitrary data point p ∈ C i and finds all densityreachable points from p with respect to ε and m pts . If p is a core point, the ε-neighborhood will be explored. If Local DBSCAN finds a point in the outer margin that is directly-density-reachable from a point in the inner margin, it is added to the merge-candidate set. If a core point is located in the inner margin, it is also added to the merge-candidate set. Each clustered point is given a local cluster ID, which is generated from the partition ID and the label ID from the local clustering: (partitionID, localclusterID). The output of a reducer is the clustered data points and the merge candidate set. Fig. 7 Two MR-DBSCAN bordering partitions P i and P i+1 along with P i 's extended partition with a blue outer margin and a green inner margin (inspired by [32]).

Mapping Profile
After each partition has undergone clustering and merge candidate lists have been generated, the merge candidate lists are collected to a single merge candidate list. The basics of merging the clusters from the different partitions are: 1. Execute a nested loop on all points in the collected merge candidate lists to see if the same data points exists with different local cluster IDs; 2. If found, then merge the clusters. Figure 8 illustrates two examples of cluster-merge propositions. Example 1: the points d 1 ∈ C 1 and d 2 ∈ C 2 are core points and d 2 is directly densityreachable from d 1 ; thus, C 1 should merge with C 2 . Example 2: The point d 3 ∈ C 1 is a core point and r ∈ C 2 is a border point; thus, C 1 should not merge with C 2 . Fig. 8 Two MR-DBSCAN bordering partitions S i and S i+1 along with S i 's extended partition with a blue outer margin and a green inner margin. The points d 1 , d 3 ∈ C 1 and d 2 ∈ C 2 are core points, and t ∈ C 1 and r, q ∈ C 2 are border points (inspired by [32]).
As it was seen in the Local DBSCAN step (subsection 4.2.2), the output of each Local DBSCAN is a merge-candidate list consisting of two types of points, namely: 1. the core points in the inner margin, and 2. directly-density-reachable points in the outer margin. Clearly, this is suitable for the present Mapping Profile step where the purpose is to create a profile that maps clusters that should be merged. The algorithm for generating the mapping profile is shown in the algorithm in figure 9. end for 8: end for 9: return MP, BP Fig. 9 Generate merge mapping profile [32].
The output of the algorithm is a list of pairs of local clusters to be merged (denoted MP) and a list of border points (denoted BP); a point p is at least a border point in a merged cluster (this is taken care of in the next step).

Merge
The previous step resulted in a list of pairs of clusters to be merged. The IDs of the local clusters should be changed into a unique global ID after merging. Thus, a global perspective of all local clusters is build (algorithm in figure 10). The algorithm generates the map: (partitionID, localclusterID) → globalclusterID. Lastly, as mentioned in the previous step, noise points are set to border points.

Classification
The result of the clustering is a set of cancer-type clusters. To enable further filtering possibilities for the end-user, a within-cluster classification is conducted such that each post within a cancer-type cluster is labeled with one of the six labels illustrated in table 2. This allows an end-user to filter the forum posts such that, for instance, only posts with breast cancer (cluster) treatments (class) are shown.
We have chosen to classify with a Naive Bayes classifier trained with a manually created training set augmented with the freely available set from The BioText Project, UC, Berkeley [36]. The Frunza et al. study also uses a Naive  Table 2 Class labels for conducting within-cluster classification of forum posts.

Class label
Class description with example post in italics Cure About cancer curing treatments. After 16 chemo sessions my cancer was gone.

No cure
About cancer non-curing treatments.
My husband went through chemo since he had bladder cancer. Sadly he passed.

Disease
About cancer without mentioning treatments. I was diagnosed with breast cancer a few weeks ago.

Treatment
About treatments without mentioning cancer.
Has anyone tried being treated with Stem Cells?

Side effect
Side effects of disease or treatment.
The chemo makes it really hard for me to swallow and a hard time eating.

Irrelevant
About none of the above. I am so sorry to hear that. Love Lea.
Bayes classifier with promising results [16]. However, they classified abstracts from scientific articles which is a somewhat different data-domain than the present study's non-clinical texts. The time complexity for training a Naive Bayes classifier is O(np), where n is the number of training observations and p is the number of features; thus, disregarding the constant, the complexity is in terms of observations O(n). When testing, Naive Bayes is also linear which is optimal for a classifier.

Clustering: DBSCAN and MR-DBSCAN Verification
MR-DBSCAN is a distributed extension of DBSCAN and they use the same principles for clustering. Thus, given the same input, the two clustering methods should yield exactly the same output. The results in this section show that this is indeed the case and we thereby consider the implementations of MR-DBSCAN and DBSCAN to be verified in terms of correctness of logical output. The actual implementations do not share code so it seems fair to disregard the odd risk of having both implementations wrong in a manner that lead to the same output.
For comparing the clusterings of DBSCAN and MR-DBSCAN, the Adjusted Rand Index (ARI) [37] is used. The index is a similarity measure between two clusterings and it is obtained by counting the number of identical labels assigned to the same clusters vs. the number of identical labels assigned to different clusters. If the label assignments coincide fully, the index is 1, and if they do not coincide at all, the index is 0. If DBSCAN and MR-DBSCAN are implemented correctly, the ARI must be 1 regardless of: 1. the number of points in the data set, 2. the number of partitions in MR-DBSCAN, and 3. the parameter settings for ε and m pts . In addition, the number of partitions (#P) in MR-DBSCAN, the coverage percentage (%C), and the number of labels (#L) in DBSCAN and MR-DBSCAN have been recorded. The results show (table 3) that the ARI is 1 in all 18 test cases; a necessary condition for this to happen, is that both MR-DSBCAN and DBSCAN yield the same number of labels in all the tests which is also the case (table 3).
Also, MR-DBSCAN has been partitioning its data into 3-8 partitions (table 3), which means that even though the data has been split and clustered individually per partition, the merging works as intended and yields the same clustering as DBSCAN. The coverage percentage value is also identical for the two clusterings in all test cases.

Run-Time Analysis: MR-DBSCAN
The purpose of this experiment is to demonstrate the run-time of each of the MR-DBSCAN steps under variations in: 1. the number of forum posts, and 2. the neighborhood radius ε. Clearly, these two parameters have the largest influence on the MR-DBSCAN's run-time. The ε parameter is used when partitioning the data set and therefore it has a direct influence on the beneficial effects of MapReduce.
In all tests, the lower point-count threshold for establishing a core point, m pts , is fixed to 5 points. This is done as the parameter only has very little run-time influence and this influence is isolated to the DBSCAN step, i.e. it does not highlight run-time differences between DBSCAN and MR-DBSCAN.
For all 30 test cases (table 4), mapping takes almost no time at all; merging has also only little effect on run-time. For relatively large values of ε, i.e. 1 and 0.1, compared to the data span, MR-DBSCAN is not able to partition the data set well. Clearly, this affects the run-time as the clustering then is performed on a single partition (or very few) and no MapReduce improvements are achieved. For relatively small values of ε, i.e. 0.001 and 0.0005, the data set is split well into partitions, but due to the low value of ε there is a large number of possible partitions, and a lot of time is spent in search of the the best partitioning. Thus, as the results show, the partitioning becomes slower when ε decreases, but the local DBSCAN becomes faster. Hence, ε needs to be set with care to strike a balance and minimize the total run-time of MR-DBSCAN. In our experiments, the balance is ε = 0.01 (table 4 and figure 11). At this point, the partitioning run-time is relatively low and likewise for the local DBSCAN; this results in a relatively low total run-time.
Decreasing ε even further to 0.0001 made the partitioning exceed the set max limit of total run-time of 16 minutes (see the grayed-out rows in table 4). Entries are also missing (table 4 and figure 11) at 5000 posts and ε = 0.0005 and 0.001 as proper equidistributed partitioning cannot be done when both the number of posts and ε are relatively low.  The purpose of this experiment is to compare run-time as a function of number of forum posts of the three different clustering algorithms DBSCAN, MR-DBSCAN, and HDBSCAN [38]. Algorithm parameters are fixed and equal  MR-DBSCAN is slower than DBSCAN in the first test case with 10.000 posts, but from this point on it is executing much faster. When DBSCAN and HDBSCAN stopped executing due to memory exhaustion of the test computer, MR-DBSCAN continued; thus, the gray cells in table 5 and the x-axis limit in figure 12. The memory exhaustion when running DBSCAN and HDBSCAN is mainly due to the growth of the tf-idf matrix which holds a forum post per row. This is simply not a feasible implementation when clustering problems become large. Clearly, divide and conquer by MapReduce help circumvent this problem.

Discussion
We argue that the information hidden in non-clinical texts is valuable and worth retrieving and activating. In the present study, the activation is done via a decision support system that helps cancer patients and caretakers to stay informed about cancer trajectories, i.e. associated symptoms, diagnoses, treatments, and outcomes, and to make informed arguments and decisions regarding treatment plans.
Concretely, the presented system analyzes non-clinical forum posts' contents by using text retrieval, clustering, and classification methods. The methods are executed in a distributed computing setup, specifically MapReduce, to achieve computational efficiency via utilization of multi-cores in modern computers. Indeed, the computational burdensome clustering was significantly improved in terms of run-time by MapReduce; thus, the clustering method MR-DBSCAN is recommendable for large clustering challenges.
Moreover, the presented system provides an interactive graphical user interface that allows end-users to mine the valuable information and to get an overview over cancer trajectories. Hopefully, the proposed system and systems alike will also help build patient/caretaker communities by leveraging the soft information not hitherto used by the established health care systems, e.g. information about emotions, feelings, or personal preferences.
The present study can be extended in several different ways. Adding, refining, and benchmarking more clustering and classification methods would yield a more comprehensive comparison that might lead to even better results, i.e. more accurate clusterings and classifications, and thus, ultimately, a better end-user service. For the classification it would especially be of interest to collect and use a larger training set. With regard to DBSCAN and HDBSCAN clustering, we experienced memory exhaustion problems on our local development machines when executing the algorithms on large sets of posts, i.e. around 50.000 posts. It is of interest to address these memory consumption challenge by redesigning the algorithms such that upper bounds on memory consumption can be guaranteed.
Lastly, it would be interesting to generalize the presented system such that it readily can be applied in other domains besides cancer; this would require an easy way of loading new data-sets and associated feature-vectors.