In this section, we will describe our data collection process from the popular online Bitcoin forum. Thereafter, we elaborate on our methodological approach and techniques.
Our objective is to understand how Bitcoin has generated trust among its users despite being anonymous and devoid of any legal and institutional backing. In this context, discussion forums have played a crucial role in the growth of Bitcoin, as users engage in the discussion and interaction to share knowledge and information. Among various online discussion forums, “Bitcointalk.org” is one of the oldest and most popular online forums, and it has a large user base. This online forum was started by Satoshi Nakamoto for the purpose of communicating with other developers (Nakamoto 2009a, b). As of September 2019, Bitcoin Forum has over 2.6 million members posting 52 million posts on 1.21 million topics (“Bitcoin Forum - Statistics Center” 2019; Table 2).
In addition to BitcoinTalk forum, there are also several other similar online forums (e.g., Cryptocurrency Talk, Bitcoin Garden, and Bitcoin Stack Exchange), which are considered useful forums focused on cryptocurrencies (Crytalker 2019). However, BitcoinTalk forum outperforms the other online forums in terms of number of users, posts, and topics (“Bitcoin Garden Forum-Statistics Center” 2019; “Cryptocurrency Talk” 2019; “Bitcoin Stack Exchange” 2019).
We base our analysis on the Bitcoin-related discussion posts collected from BitcoinTalk forum for two important reasons. First, compared to other data sources mentioned in Table 3, online discussion platforms serve as a good alternative to source data, as discussions, interactions, opinions, and the flow of information can be accessed on an unprecedented scale. Second, discussion data do not condition the study or experiment that is being conducted; rather, they are naturally generated by the users. This allows us to infer the technological attributes related to trust in Bitcoin from the users’ own words and statements, which were used to address their concerns or share information within the Bitcoin community.
To collect data from the discussion forum, we wrote a web scraping script using Python package beautifulsoup.Footnote 1 As our objective is confined to Bitcoin, we limit our analysis to general discussions on Bitcoin covering four subtopics of legal, press release, meetups, and important announcements. We downloaded close to 1.97 million discussion posts written between March 1, 2012, and September 21, 2018. Our data included original posts and replies to them, dates of each post and reply, and the metadata about the users who posted.Footnote 2
Our aim was to identify the technological attributes that have contributed to the creation of trust among Bitcoin users. The first step in this direction required us to identify the relevant posts. A naïve approach would have been to use a simple keyword search to retrieve trust related posts. However, such an approach entails two fundamental issues. First, discussion posts are user-generated data, where users are not obliged to adhere to the grammatical form and correctness. Further, users can use a combination of words to mean trust. Second, in the context of large data, a keyword-based search, for example, trust related search, can easily return tens of thousands of posts. Even with this large number of search results, the search would fail to include the posts with words such as faith, belief, and reliability, which are concepts similar to trust. Further, identifying relevant posts would require manually reading all posts, which would have been impractical with the volume of data we had. Moreover, any kind of ordering of the posts would be a difficult task, considering it requires the development of a consistent and reliable scoring method to measure the trust related content in the post. Any use of such method would require overcoming the variance among the persons scoring the posts.
To circumvent these issues, we relied on the vector representation of words and documents generated through paragraph vector, also known as doc2vec, proposed by Le and Mikolov (2014). After the vector representation, we computed the semantic similarity to identify the closeness of the posts and words to trust. Semantic similarity is a metric defined over the corpus, where the similarity is based on the likeness of their semantic meaning instead of the syntactical representation. In terms of learning semantic similarities, doc2vec methods have demonstrated superior performance to competing methods, such as bag of words (Dai et al. 2015; Le and Mikolov 2014). The root of the paragraph vector method lies in the usage of a neural network in predicting the word near the word. In this neural network-based method, a vector of weights is trained to maximize the prediction of the nearest word for a word in a given context. Like a classification problem, the model learns the network weight in order to maximize the prediction of the nearest word. However, unlike the classification problem, these networks output the learned weight as a vector and semantic representation of the text rather than the final prediction from the model. These vectors - word embeddings - are considered a good representation of the text as they capture semantic similarities by contributing to the nearest word prediction task. Mikolov et al. (2013) reported state-of-the-art performance in learning semantic similarities and relationships. For instance, the word vector method could produce a relationship such as “King – Man + Woman = Queen” (Mikolov et al. 2006). The semantic meaning of the word “king” refers to leadership and also refers to the gender as male. If you remove the semantic meaning of “Man” (which is Male) in the word and add the semantic meaning of the word “Woman” (which is female), then you are semantically left with the concept of leadership and female. This combination of female and leadership is represented in word form as “Queen”. Later, this idea was extended by Le and Mikolov (2014) as a paragraph vector, also known as doc2vec, which could learn the semantic representation for both the words and documents in the same vector space. Their approach involved important improvements, including the ability to retain word order, and also allowed the text of variable length.
To train doc2vec model, we relied on the implementation provided by Python package Gensim (Rehurek and Sojka 2010). Among available models, we trained three different variants of the doc2vec model: paragraph vector with a distributed bag of words (PV-DBOW) with doc vectors only, PV-DBOW in a skip-gram mode with word vectors trained with document vectors, and PV- with distributed memory (PV-DM) using the sum. Due to the huge resource requirements involved in estimation PV-DM with concatenation, we omitted it from our potential model alternatives. In training all three doc2vec models, we set the same model parameters: vector size to 300 dimensions; context window size of 10; minimum word frequency to 5; and epochs to 50. Here, vector size refers to the dimension of vector outputs for both word and document vectors. Similarly, context window size refers to the length (number of words) that is considered as a context. Here, while learning the semantic relationship, our model considered 10 words at a time in a sliding fashion for each document. Given that the post could be of varying length, we considered the context window size of 10 to be a reasonable choice. Similarly, an epoch refers to the number of training iterations. We trained our models with 50 iterations, well above the practice of 10 iterations. After training models, we manually evaluated the word and document similarity in randomly selected 20 words. The results showed that paragraph vector with a distributed bag of words with documents and words trained together performed better in comparison to other models, making it our preferred choice.
Figure 3 summarizes the methodological approach used in the analysis of the discussion posts collected from Bitcointalk.org. Our methodology included cleaning and pre-processing the text, followed by training the doc2vec model. Once the model was trained and the preferred model selected, we analyzed the data, building on two sets of output from doc2vec model: 1) word vectors: semantic representation of words in the collection of posts; and, 2) document vectors: semantic representation of the actual posts. To analyze the data, we first extracted fifty posts closest to the keywords related to the constructs - reliability, functionality, and helpfulness. Given a word as input, the doc2vec model can return similar document - words with a similarity score using cosine similarity. Cosine similarity is computed over the vector representation of the given words, and the closest word or document vectors with the highest cosine similarity score is returned. Since these vectors are learned from data exclusive to Bitcoin discussions using several context windows with the given words, we deemed it as a suitable method to extract relevant posts.
Once the posts were collected, we read through them to identify the technological attributes. Moreover, to understand the semantic proximities between the words, we first computed the pairwise similarity and then visualized it as a network of words. The pairwise similarity matrix gives a similarity score between the pair of words, whereas visualizing it as a network allows us to see the relative positions between the attributes when all the pairwise similarity are considered. These relative positions of the words, as the node in the network, allows us to see the semantic proximities between the word vectors. We will present our results using descriptions of the identified attributes with example posts in the following section.