Introduction

Blockchain technology and its economics have attracted considerable attention from academic researchers. The total volume of research has increased dramatically, with the proportion of empirical studies growing gradually in recent years (Casino et al. 2019; Xu et al. 2019; Frizzo-Barker et al. 2020). Data availability is often a primary obstacle in empirical studies in emerging research areas, such as blockchain, where it is not clear which alternative data sources should or can be used for quantitative analysis. Owing to its nature, a blockchain primarily comprises numerical data such as on-chain transactions by users or network (value) metrics, trading activity, price data of cryptoassets, or financial reports of the few available companies, most of which are readily available in the public blockchain. However, these datasets can be complemented by text data to obtain more data from consortiums and private blockchains, thus expanding the research span and deriving additional relevant insights.

Given the decentralized nature of the public blockchain ecosystem, there are limited compulsory disclosures or official platforms representing the comprehensive information of single blockchain projects that can serve as sources of blockchain-related information. Alternative sources of textual data play a vital role for different parties in gathering information and making decisions within a blockchain network. For example, the sentiments of a crowd (via news, social media, or other text sources) may be a more relevant reference for investment in the blockchain ecosystem than in corporations. Such data can affect the market, influence investors’ decisions, and provide an impetus for blockchain development. Researchers can make use of texts in blockchain-related contexts to obtain information in the data from more perspectives (i.e., explore not only the metadata describing the data but also the actual content of the data) and make inferences that cannot be made before with only numbers.

Therefore, in this study, we focus on providing an overview of text analysis methodologies and data sources as they pertain to blockchains, which differ from the text-based analyses of corporations. There is no consensus on the type of text data that should or could be used to analyze a specific blockchain network or project; therefore, our systematic overview helps alleviate this concern.

Several types of blockchain-related text data are publicly available. First, blockchain is a frequent topic in news articles reporting, with subtopics including the performance of cryptocurrencies and the latest developments in the technology. Second, because of the technical nature of blockchain technology, online platforms or forums such as Twitter, GitHub, and Reddit have been actively used by different groups (e.g., investors and developers) to express their opinions and share and track new developments (Mendoza-Tello et al. 2018). Blockchain startups also use social media for marketing. Third, blockchain project whitepapers provide key information (e.g., technical and marketing) to potential investors and are the primary method for understanding project details (Cohney et al. 2019).

In all these cases, manual examination of large-scale text content is exceptionally labor-intensive and time-consuming, if not impossible. Hence, computer-based text analysis is essential. Researchers across disciplines have provided guidelines for using such type of approaches. Grimmer and Stewart (2013), for example, illustrate the promise and the pitfalls of text analysis for political science. Günther and Quandt (2016) give a comprehensive overview of text analysis methods useful in digital journalism research. Studies in economics and finance have addressed the advantages and disadvantages of different methodologies (Loughran and McDonald 2016; Cong et al. 2021; Gentzkow et al. 2019).

Such reviews have not been conducted in blockchain-related research areas, despite the close connection between blockchain technology and multiple text datasets. Therefore, we argue that it is necessary to use a transparent approach and an academic standpoint to synthesize the current knowledge in the literature to better understand the relevance and potential of text analysis. In this study, we conduct a systematic literature review by examining published and unpublished academic literature, focusing on text analysis associated with blockchain topics across disciplines. We provide the fundamental principles and relevant sources of text analysis methodologies and connect the relationships of research scopes, text data, and methodologies to provide researchers with a reference for choosing suitable combinations of the above elements with respect to their research question at hand. We then pinpoint the specific research topics studied in the literature and propose directions for future research. This review serves as a guide for researchers from different disciplines interested in conducting blockchain-related text analysis studies.

Research methodology

We conduct a systematic review of the academic literature on blockchain-related research using text analysis. Research in this area has expanded because of the rapid development of blockchain technology. However, because of the interdisciplinary nature of blockchain research, research perspectives vary starkly, posing difficulties in searching for and gathering knowledge beyond a single field. We focus on computer-based text analysis used in blockchain research to comprehend and synthesize studies across disciplines that utilize text analysis as a primary or ancillary methodology. We aim to gain knowledge from the existing literature in this area and discover future research opportunities. We adopt the guidelines of Siddaway et al. (2019) and the PRISMA statement (Liberati et al. 2009; Moher et al. 2009; Page et al. 2021a, b).

Definition of research questions

The first stage of a systematic review involves defining research questions that guide subsequent actions. We propose the following research questions to achieve the objectives of our review:

RQ1

Which research scope, text data, and methodology are used to conduct text analysis in the blockchain area?

Both blockchain and text analysis are broad concepts. This question is designed to identify the specific scope of the studies (e.g., cryptocurrency,Footnote 1 smart contractFootnote 2), the text data being analyzed (e.g., social media posts and news), and specific methodologies or techniques used to perform the analyses (e.g., sentiment analysis). We aim to bridge and highlight the connections between these elements in each study. This will assist researchers in selecting the appropriate data and methodologies for their research.

RQ2

What topics are addressed using text analysis in current literature?

The research questions determine how the research develops, and text analysis is one of the methods used to serve the purposes of a study. Regardless of whether text analysis is used alone or as part of a broader analysis, we intend to provide an interdisciplinary overview of the topics and research questions addressed in the existing literature, and illustrate how text analysis contributes to the study of these topics.

RQ3

What are the research gaps and promising future research topics?

Based on the findings of our review, we identify understudied areas and future research opportunities using text analysis in blockchain research. This allows researchers to recognize promising research topics and specify the methodologies (and data) they can use.

Literature search and selection

Initial keyword searches were conducted on May 24, 2022, followed by updated searches on August 23, 2022, to find relevant studies. We chose the Web of Science (WoS) and Scopus databases to cover publications indexed in academic databases. As text analysis in blockchain research is relatively new, some studies may not have been published. Therefore, we also performed a keyword search of the Social Science Research Network (SSRN) to distinguish unpublished papers (e.g., working and discussion papers) (Garanina et al. 2021). Subsequently, backward snowballing of the articles obtained through keyword searches was performed to identify additional articles.

For a comprehensive result, our query keywords encompassed not only blockchain and text analysis but also synonyms and multiple specific topics relevant to the area. Relevant words from blockchain included blockchain, cryptocurrency, stablecoin,Footnote 3 crypto token, smart contract, initial coin offering (ICO), security token offering (STO), and initial exchange offering (IEO),Footnote 4, and non-fungible token (NFT)Footnote 5 Keywords from text analysis included text analysis, textual analysis, text analytics, topic modeling, natural language processing (NLP), word embedding, sentence embedding, bag of words , and sentiment analysis. We also used asterisks (*) and quotation marks (“”) to eliminate the impacts of plural forms, hyphens, or spelling variations. A description of our keyword-selection process and a complete list of keywords are included in the Appendix.

Keywords were searched in the title, abstract, and keywords.Footnote 6 The exact query is as follows:

(blockchain* OR cryptocurrenc* OR stablecoin* OR “crypto token*” OR “smart contract*” OR “initial coin offering*” OR “security token offering*” OR “initial exchange offering*” OR “non*fungible token*”) AND (“text* analysis” OR “text analytics” OR “topic model*” OR “natural language processing*” OR “word embedding*” OR “sentence embedding*” OR “bag of words” OR “sentiment analysis”)

The details of the literature search and selection process are presented in Fig. 1. Search queries in the two databases returned 517 records. First, we screened the metadata of the articles to remove articles that were (1) non-English articles, (2) notes, editorials, conference proceedings titles, and preliminary papers, (3) duplicates, and (4) without full-text access. We screened the titles and abstracts to remove articles based on our content-based exclusion criteria. To obtain relevant articles from multiple perspectives, we did not set inclusion/exclusion criteria by discipline. Alternatively, we checked the content of the articles and only excluded an article if (1) it did not contain information related to both blockchain and text analysis, (2) it focused purely on the technical aspect of blockchain, or (3) it did not specify the specific text analysis techniques used. After the above screening, 140 articles remained for full-text assessment, and we applied the exclusion criteria again and obtained 99 published articles. Our search on SSRN initially returned 30 articles. We removed 24 articles based on our exclusion criteria, leaving six unpublished articles. Subsequently, we conducted backward snowballing on 105 articles included in the keyword searches (i.e., we went through the references of the included articles) to find additional articles that did not appear in the keyword searches. This process yielded nineteen additioanl 19 papers. A total of 124 studies were included in the literature review.

Fig. 1
figure 1

The flowchart of the literature selection phases

Descriptive results

This section reports the descriptive results of the papers, including publication trends, keyword networks, and citation rankings.

Publication trend

Figure 2 depicts the number of papers on a yearly basis subject to article type and research area. Although we did not set any timeframe restrictions in our keyword search, the first blockchain paper using text analysis appeared in 2015, 6 years after the birth of the Bitcoin blockchain (Nakamoto 2008). The total number of papers published annually has been increasing, indicating the growing interest in and recognition of text analysis as a methodology for blockchain-related research. Until 2019, conference proceedings were the main channels through which related papers were published; however, from 2020 onward, the number of papers published in journals began to increase. For several years, computer science papers have largely dominated the topic, which can be explained by the entry requirements for coding skills in many machine learning-based text analyses. Nevertheless, later years saw a growing number of papers from business-, economics-, and finance-related fields. Studies from other areas, such as social sciences and multidisciplinary studies, have also contributed to this topic. The number of papers in most of these areas remains limited. However, the growing diversification of research areas indicates that interest has begun to spread from computer science to these areas.

Fig. 2
figure 2

The types and research areas of the publications in each year

We analyzed the network of papers’ keywords (see Fig. 3).Footnote 7 The size of the nodes reflects the frequency, the connection between the nodes indicates the co-occurrence of keywords in a paper, and the color of the nodes indicates the average year in which the keyword appears. The most common keywords are the three blockchain concepts: Bitcoin, cryptocurrency, and blockchain. Bitcoin had the earliest average occurrence and was associated with crime (e.g., crime, DarkNet market), social media (e.g., social networking, Twitter), and sentiment (e.g., opinion mining and sentiment analysis). Cryptocurrency is associated not only with crime but also with financial activities (e.g., financial services and investments), classification, and clustering (e.g., recurrent neural networks, deep learning, and topic modeling). The keyword blockchain tends to co-occur with specific applications (e.g., commerce and FinTech), topic modeling, and relationship analysis (e.g., network and trend analyses). Different keyword associations imply that the different scopes of topics within a blockchain are related to distinct economic activities and analyses. Individual text analysis-related keywords are mentioned less frequently; however, they appear in each blockchain scope. Sentiment analysis tends to go together with Bitcoin and cryptocurrency, whereas topic modeling and the corresponding keywords connect closely to cryptocurrency and blockchain.

Fig. 3
figure 3

The keyword frequency and co-occurrence networks

Citation ranking

Citation analysis helps identify the impact and common concerns of papers. However, one problem with using citations as an indicator of impact is that older papers have longer periods of citation accumulation. Thus, to offset this problem, we ranked the papers in terms of both total citations and citations per year (CPY) (Dumay and Cai 2014) and considered the top ten papers from both criteria. Table 1 lists these papers and summarizes their text data, sample period, text analysis techniques, and brief abstracts of the papers.

Nine papers appeared on both lists; one older paper (Georgoula et al. 2015) fell short of CPY and was surpassed by a newer paper (Kim et al. 2020). The topics of high-impact papers tended to concentrate on a narrow range. Ten studies applied sentiment analysis and nine explored the predictive power of sentiment from social media platforms/news for cryptocurrency prices. Most studies focused on Bitcoin or a few altcoins with large market caps, while Kraaijeveld and de Smedt (2020) included nine cryptocurrencies, and Li et al. (2019) studied a smaller cryptocurrency called ZClassic (ZCL). One study examined the sentiments of blockchain-related tweets and found that blockchain benefits were discussed more than its drawbacks (Grover et al. 2019). The study by Kim et al. (2020) proposed a new topic modeling method and applied it to conduct a literature review on blockchain research to discover research trends. A detailed discussion is provided in Table 1.

Table 1 The top 10 most cited papers by total citation and citation per year

Discussion of research questions

RQ1

Which research scope, text data, and methodology are used to conduct text analysis in the blockchain area?

In this section, we briefly introduce the scope, text data, and methodologies used in the papers and bridge the elements to identify the most used combinations. Figure 4 displays the connections among research scopes, text data, and methodologies in proportion to the number of papers.Footnote 8

Fig. 4
figure 4

The connections among research scope, text data, and the methodology

Research scope

‘Specific cryptocurrency’ (72 papers, 58%) is the most frequently used scope and Bitcoin in particular is the most studied cryptocurrency. To better recognize the importance of Bitcoin, we separate studies that focus exclusively on Bitcoin (40 papers, 32%) from the others. Other studies examine cryptocurrencies with large market caps, special small cryptocurrencies (Li et al. 2019; Mnif et al. 2021; Vacca et al. 2021), or a large number of cryptocurrencies to represent the market (Steinert and Herff 2018; Schwenkler and Zheng 2021).

Another substantial scope is the general concept of blockchain (26 studies, 21%). These studies treat blockchain technology and its applications as a whole and discover its uses in particular fields (e.g., supply chain management (Medhi 2020; Hirata et al. 2021; Xu and He 2022), banking (Daluwathumullagamage and Sims 2020), and accounting (Garanina et al. 2021)) and how blockchain-related topics evolve (over time) (Zhang et al. 2021a; Chousein et al. 2020; Medhi 2020; da Silva and Moro 2021; Zeng et al. 2018; Shahid and Jungpil 2020; Perdana et al. 2021).

The literature also covers the scope of the cryptocurrency market as a whole (11 papers, 8.9%) (Caliskan 2020; Siu et al. 2021), ICO projects (13 papers, 10.5%) (Toma and Cerchiello 2020; Liu et al. 2021; Sapkota and Grobys 2021), and smart contract (two papers, 1.6%) (Ibba et al. 2021; Zhang et al. 2021a).

It is worth noting that, in our search, the keywords also included stablecoin, NFT, and STO, but we found no papers that used text analysis to examine these scopes. This may have resulted from the late development of these blockchain use cases. However, increasing growth in such applications has been observed in recent years (Lambert et al. 2021; Wang et al. 2021b), thus creating opportunities and the needs to address relevant research questions using text analysis.

Text data

Table 2 summarizes the text data and corresponding data sources we identify from the papers, which helps researchers navigate to the sources of their target data. We categorize texts into four groups: (1) corporate-produced documents, (2) user-generated content, (3) news, and (4) academic papers.

Table 2 The detailed information of text data sources used in the literature

Corporate-produced document Corporate-produced documents utilize formal and technical languages to provide detailed information about the company or specific products and services. Despite the precise information provided by these documents, we found only 18 studies that used such texts. ICO whitepaper, which pitches the project idea and outlines the business plan, is a voluntary disclosure by the ICO project team to attract potential investors (Florysiak and Schandlbauer 2022; Thewissen et al. 2022). Another example of such document is smart contract code. Although the code does not strictly belong to human language, its fixed format enables researchers to obtain information regarding the subject of the contract (Ibba et al. 2021; Zhang et al. 2021a). Blockchain-related texts can also be extracted from corporate documents, such as SEC and patent filings, through keyword searches and used to examine blockchain adoption (Yen and Wang 2021; Wang et al. 2021a; Zhang et al. 2021a; Stratopoulos et al. 2022).

User-generated content Among all text data, user-generated content was the most frequently used (85 times, 64%). This type of text features a shorter length and informal language, and generally expresses the opinions of users on a particular topic. Social media platforms offer rich resources for such texts (56 times, 42%). Specifically, most studies chose Twitter to extract text data for conducting the analyses (Patil et al. 2018; Huynh 2021; Mareddy and Gupta 2022), while others used Sina Weibo (a Chinese microblogging website) or Stocktwits (a social media platform focused on financial topics) (Chen et al. 2019a; Pan et al. 2020; Huang et al. 2021).

Compared with social media platforms, online forums often have a specific focus and attract users with shared interests; therefore, they tend to offer deeper discussions. Cryptocurrency-specific forums, such as bitcointalk, XRPChat, and Ethereum Community Forum (Kim et al. 2016; Gurdgiev and O’Loughlin 2020), have sections with distinctive topics. User discussions on topic-focused forums, such as GitHub, Reddit, and StackExchange have provided insights into the development of blockchain (Hinds-Charles et al. 2019; Bahamazava and Reznik 2022; Ortu et al. 2022). There are numerous cummnities (i.e., subreddits) within the crypocurreny framework of Reddit (e.g., r/CryptoMarkets, r/Bitcoin), and users can join the communities to share up-to-date news or express their opinions on topics. In contrast, HackForums contains posts on illicit activities (Siu et al. 2021).

News News articles are one of the most widespread and accessible types of textual data. They provide up-to-date factual information on events, and commentaries/opinions on a topic. Analyzing blockchain news on a scale allows researchers to identify the evolution and public sentiment toward the technology. For instance, multiple news channels report the upcoming Ethereum Shanghai Hard Fork, but they contain different sentiments toward the event: FXStreet (2023) neutrally introduces the updates it would bring; U.Today (2023) illustrates multiple reasons for developers to be concerned about the hard fork, while Bloomberg (2023) is comparatively optimistic about it by emphasizing that “Shanghai is expected to push more people and institutional investors to stake their coins to support the Ethereum network and earn yield.”

Many studies use cryptocurrency-specific news channels (e.g., Coindesk and Cointelegraph) as their primary news data sources (Karalevicius et al. 2018; Farimani et al. 2022), whereas others search for blockchain-related news from financial newspapers (e.g., The Financial Times and The Economist) through keyword searches (Azqueta-Gavaldón 2020).

Academic paper Literature reviews assist researchers in understanding the current status of research, identifying research gaps, and guiding future research (Chakkarwar and Tamane 2019; Shahid and Jungpil 2020; Garanina et al. 2021). Unlike the standard literature, in which researchers spend time manually examining papers, the automated processing of text-analysis-assisted literature reviews enables researchers to acquire insights into a large number of papers in a specific area in a short time.

Methodology

Choosing a suitable methodology depends not only on the data characteristics but also on the research questions of the study. Our goal is not to provide a systematic classification of the methodologies, but to provide a big picture of the methodologies used in blockchain-related literature. Therefore, the methodologies presented in this section may overlap. For example, the underlying methodology of sentiment analysis can be a machine-learning-based classifier. This section outlines the principal methodologies most directly related to the research questions. In addition, we summarize the specific text analysis techniques used in the papers in Table 3 to provide supplementary details.Footnote 9

Table 3 The detailed information of text analysis techniques used in the literature

Text preprocessing Before conducting the actual analysis, multiple cleaning procedures should be applied to the raw text to prepare it as the input material. The necessary steps vary depending on the text condition and planned analysis. However, we identified standard preprocessing steps suitable for the majority of texts: removing special characters and punctuation, removing numbers and stopwords, lower-casing, spelling corrections, tokenization, assigning part-of-speech tags, and stemmization/lemmatization. Some raw texts require more cleaning than others. For example, texts from social media and online forums usually use informal language and emojis which can lead to misinterpretation. Papers therefore conducted additional procedures (Birim and Sönmez 2022; Critien et al. 2022): remove # and @user, remove URL links, convert emojis to words, and convert vocabulary abbreviations to words. These procedures remove redundant text, convert unrecognizable characters into valuable information, and are vital preparation steps.

Feature extraction The cleaned texts should be transferred to number representations to allow the computer to read and use for further analyses. It can also reduce computational complexity, enhance performance, and avoid the overfitting problem, making it an essential procedure in text analysis (Kou et al. 2020). This representation per se can also provides information and insight. Count-based methods are straightforward to understand and interpret. The Bag-of-words (BoW) is one of the most widely used approaches. It represents words according to their frequency in the corpus, disregarding order and context. N-grams are extensions of BoW that breaks the corpus into a contiguous sequence of n words. It can capture more context around each word, but produces a sparser feature set than BoW. BoW and N-grams assume that words that occur more frequently are more relevant and do not always hold true. Term frequency-inverse document frequency (TF-IDF) (Salton et al. 1975) adds another metric of how rarely a word occurs across the entire corpus and assigns rarer words a higher score. Although such representations are generally used as inputs for further analysis, we identify papers that highlight frequent words and interpret them as blockchain topics (Zeng et al. 2018; Burnie and Yilmaz 2019; El-Masri and Hussain 2021). However, this method can be misleading, because count-based methods discard linguistic structures and may miss crucial text information.

Word-embedding mitigates this problem by representing words in vectors to capture their semantic and syntactic contexts in a document (Cong et al. 2021). In the vector space, the shorter the distance between two word vectors, the higher is the similarity of the words. Word2vec (Mikolov et al. 2013) is one of the most frequently used word embedding methods. It includes two configurations: skip-gram and continuous bag of words (CBOW). A skip-gram uses the current word to predict the surrounding words, whereas CBOW predicts the current word using its surrounding words. A generalization of word2vec and doc2vec (Le and Mikolov 2014) adds a document feature vector to the word vector to capture the semantics of the paragraphs and documents. Word-embedding techniques are not frequently used in the literature, but we found that Kim et al. (2020) and Liu et al. (2021) integrated these techniques when processing their texts. Two other word-embedding models, GloVe and fastText, were used by Kilimci (2020).

Analysis Sentiment analysis is the dominant text-analysis approach in the literature (80 times, 53%). There are two major types of sentiment analysis: lexicon/rule-based and machine learning-based (Vohra and Teraiya 2013).

Lexicon-based sentiment analysis calculates the sentiment score of a text based on the polarity of each word (i.e., positive, negative, or neutral) from sentiment dictionaries in which each vocabulary is assigned a sentiment score. Examples of well-established sentiment dictionaries include Valence Aware Dictionary for Sentiment Reasoning (VADER) (Hutto and Gilbert 2014), which is particularly suitable for social media contexts, and Loughran and McDonald sentiment lexicon (LM lexicon) (Loughran and McDonald 2011) in the finance domain. However, off-the-shelf dictionaries can sometimes generate inaccurate results because of different sentiments of the same vocabulary in different contexts (Loughran and McDonald 2011). Therefore, some researchers have developed new and additional dictionaries (e.g., new vocabularies and emojis) in blockchain contexts for higher accuracy of sentiment quantification (Chen et al. 2019a; Barth et al. 2020; Kraaijeveld and de Smedt 2020).

Machine learning-based sentiment analysis adopts machine learning classifiers to study the sentiments of texts and classify them into instinctive sentiment groups. Researchers can build a model and train their data or apply a pre-trained model (e.g., Bidirectional Encoder Representations from Transformers (BERT)) to their analysis. Compared to lexicon/rule-based sentiment analysis, it is dynamic and can better fit the research context. We identified 12 papers that adopted this approach (e.g., Patil et al. 2018; Balfagih and Keselj 2019; Inamdar et al. 2019; Aslam et al. 2022). In particular, Han et al. (2020) and Akba et al. (2021) propose and assess new models for sentiment analysis.

Sentiment analysis tools have also been utilized in academic studies (Lu et al. 2017; Stanley 2019; Caviggioli et al. 2020; Moustafa et al. 2022). Such tools develop unique algorithms and reduce the programming requirements for researchers. However, most of these tools are commercially oriented, incur high subscription fees, and lack transparency regarding their algorithms. Hence, albeit the convenience, researchers should be cautious when using such tools.

In some studies, emotion-detection metrics have been applied in conjunction with sentiment analysis to achieve more precise emotion separation. For example, the NRC-VAD Emotion lexicon has three dimensions: valence, arousal, and dominance (Mohammad 2018). This provides another layer for sentiment and can increase the quality of the analysis.

The Latent Dirichlet Allocation (LDA) and its variations were frequently chosen (33 times, 22%) for text analysis. LDA is a topic-modeling algorithm developed by Blei et al. (2003). Topic modeling can identify the patterns of vocabulary and phrases in documents (within the corpus of interest), detect the differences in their topics, and cluster the documents according to the topics discussed in the documents. LDA is one of the most popular topic-modeling algorithms. It assumes that each document in the corpus consists of a number of latent topics and that each topic is characterized by a word distribution. Each topic is presented with a list of words and their fitting possibilities. Its variations include dynamic topic models (DTM), which add temporal features to the model (Blei and Lafferty 2006) and SentLDA, which considers the boundaries between sentences and assumes that all words in a sentence are sampled from the same topic (Bao and Datta 2014). The texts used in LDA models are typically unlabeled, and the researchers’ task is to choose the optimal number of topics, which is primarily determined by the perplexity and coherence scores (Blei et al. 2003; Newman et al. 2010). After narrowing down the choices for the optimal number of topics, researchers become involved and integrate their interpretations to choose the optimal number of topics for the model. Together with other topic modeling and clustering algorithms, they belong to unsupervised machine learning. Evaluations of unsupervised machine learning vary from model to model, and human judgment is often required to evaluate the model quality. Nevertheless, these models are valuable for exploring the underlying features of a text without establishing an upfront framework (Grimmer and Stewart 2013). This is especially applicable to research in blockchain, which is still understudied and has few established classifications.

In contrast, supervised machine-learning classifiers are applied to pre-labeled texts, and the texts are classified into pre-specified groups. The idea is to first manually categorize a set of documents and then train a supervised model that automatically learns how to assign categories to documents using a training set (Bao and Datta 2014). Owing to the training process, they are domain-specific and better fit the research context (Grimmer and Stewart 2013). Multiple models are often applied to the same dataset and researchers can easily compare the performance of classifiers using certain metrics (e.g., precision, recall, accuracy, F1-score) to select the best-fitting model. Nevertheless, in blockchain-related research, they are utilized much less for text data (nine times, 6%).

Bridging the elements

Figure 4 shows that the combinations of the elements are diversified depending on the purpose of the studies. Nevertheless, we observe two primarily adopted paths for text analysis in blockchain research: (a) papers studying specific cryptocurrencies tend to apply sentiment analysis to instant user-generated content or news articles to discover the correlations between public opinions/emotions and cryptocurrency market behavior, and (b) papers studying the broad concept of blockchain primarily choose official documents from companies (e.g., SEC and patent filings) and apply topic models to explore the classifications or trends in the sector.

The links among the above elements are not permanent; that is, researchers can choose combinations according to their requirements. To select effective combinations, researchers must understand the characteristics of the data, presumptions to use a particular methodology, and the questions they intend to investigate. The design should facilitate the generation of interpretable and meaningful results to answer the research questions.

RQ2

What topics are addressed using text analysis in current literature?

The data and methodologies are used to serve the purpose of the study and should be chosen depending on the research questions (Grimmer and Stewart 2013). In the following section, we summarize blockchain-related topics discussed in the existing literature that involve text analyses.

Relationship discovery

Researchers have used different text data (often combined with other variables) to identify correlations. The speculative nature and high volatility of cryptocurrencies have led to studies exploring the relationship between market fluctuations and information on online platforms. Different factors of online discussions, including the counts of specific keywords, discussions of different topics, and sentiment classes, are extracted. These factors are used as variables to test whether they are associated with cryptocurrency market activities, such as price changes and the co-movement of peer cryptocurrencies (Polasik et al. 2015; Phillips and Gorse 2018; Barth et al. 2020; Schwenkler and Zheng 2021). From more specific perspectives, studies distinguish different user groups and vocabularies and find that content from certain groups or the presence of certain words is more closely related to changes in the cryptocurrency market (Burnie and Yilmaz 2019; Kang et al. 2020). Xie (2021) explores the relationships among online discussions and demonstrates that online communities’ conflicting opinions and redundant discussions result in low trading volumes.

An ICO whitepaper, perceived as a prospectus for an initial public offering (IPO) in a less regulated way, provides information that can impact investors’ decisions and, to some extent, determine the success of projects. Many dimensions of such texts influence the performance of ICO. For instance, ICO projects with higher technological sophistication shown in whitepapers are more likely to be successful and less likely to be delisted (Liu et al. 2021). Those whitepapers that are unique-that is, have more project-specific information and avoid borrowing common phrases from previous whitepapers-can lead to higher fundraising amounts and better post-ICO performance (Yen and Wang 2021; Florysiak and Schandlbauer 2022). The readability and sentiment expressed in whitepapers can also affect investors’ decisions to invest in the described project (Stanley 2019; Sapkota and Grobys 2021).

For public companies that meet higher disclosure standards, blockchain-related information can be extracted from 10-K filings and used to investigate whether blockchain adoption brings value and efficiency to companies (Yen et al. 2021).

Cryptocurrency performance prediction

Forecasting has always been an important topic in cryptocurrency studies. In addition to econometric methods and statistical models for price prediction, sentiment has also been used as a predictor of market movement (Mao et al. 2011; Fang et al. 2022). The effect of sentiment on the cryptocurrency market could be magnified by the lack of traditional financial fundamentals in valuation, and vocal and active investors on social media (Corbet et al. 2018; Gurdgiev and O’Loughlin 2020). Machine learning models, especially supervised models, are often applied to use sentiment data for prediction. Sentiment is used as the sole input to a model or as a supplement to conventional variables (e.g., price, trading volume, blockchain metadata (Sebastião and Godinho 2021)).

Texts from social media are extracted, and each document is assigned a sentiment score using a sentiment analysis technique (see Table 3 for details). The scores (along with other variables) are subsequently used as inputs for the prediction models. They have predictive power for the direction of price movement (Loginova et al. 2021; Critien et al. 2022) and the short-term (e.g., hourly and daily) magnitude of price changes (Li et al. 2019; Farimani et al. 2022; Ortu et al. 2022).

The impact of social media content depends particularly on the level of information dissemination. Thus, celebrity or opinion leader posts (i.e., influencers) or discussions about them could have more power than other posts (Kang et al. 2020). Huynh (2021; 2022) quantifies the tweet sentiments of Donald Trump and Elon Musk using LM lexicon and finds that negativity in Trump’s tweets leads to higher returns on Bitcoin, whereas both pessimistic and optimistic expressions from Musk have a positive effect on Bitcoin returns. Cary (2021) analyzes the tweet sentiment about Elon Musk’s performance on Saturday Night Live on 8 May 2021 and found that the negative opinion toward his performance led to the price decline of Dogecoin.

Prediction models have also been used in ICO studies. Text data variables (e.g., expert reviews and social media sentiment) and non-text variables (e.g., sale price, project duration, and expert ratings) are utilized simultaneously to predict the success of ICO projects (Xu et al. 2021; Chursook et al. 2022).

Overall, studies focusing on predicting market movements and project success constitute a large proportion of the papers in this review. However, the data and methodologies mainly follow a similar direction: applying sentiment analysis to Twitter posts and associating the respective sentiment metrics with high market capitalization cryptocurrencies.

Classification and trend

One step in understanding large-scale texts containing multiple documents is to categorize the documents and create classifications. Using clustering/topic models or classifiers, content features (i.e., the topics discussed) in documents can be extracted and used to group documents into different classifications. By adding a temporal dimension to the static classification, the classification information can provide the trends of a particular group of topics.

Such models can be valuable when applied to academic papers in literature reviews to facilitate an understanding of existing studies and identify further research. Unlike standard literature reviews, in which researchers read through papers to derive results, topic modeling-based literature reviews extract the titles and abstracts of papers and rely on algorithms to extract topics from the texts. Classification algorithms are used to understand the current state and development of blockchain research (Chakkarwar and Tamane 2019; Shahid and Jungpil 2020; Lee et al. 2022). Some studies have dived into blockchain applications within a sector (e.g., consumer trust, banking, and accounting) to facilitate researchers and practitioners in identifying future research areas and business opportunities (da Silva and Moro 2021; Daluwathumullagamage and Sims 2021; Garanina et al. 2021). Although it enables researchers to examine text content on a large scale without time-consuming manual reading, one of the drawbacks of using text analysis for literature reviews is the lack of an information screening process, during which irrelevant papers are excluded from the review.

Most papers included in this review (Xu and He (2022) is an exception) directly use all papers from the keyword search results as their input for topic models and further analyses. In this case, many irrelevant papers may be erroneously included in the models and the noise information they contain can be significant, leading to biased or inaccurate conclusions. To avoid undermining the advantages of topic modeling, researchers must carefully design the selection criteria for their dataset when performing such studies.

At a more technical level, the classification and trends of blockchain infrastructure and application design problems have also been addressed. Using texts from technique-oriented platforms (e.g., GitHub and StackExchange), some studies have observed a shift in developers’ interests from mining to software development (Alahi et al. 2019; Hinds-Charles et al. 2019). A special case involves the use of a smart contract code as an input for topic models or classifiers. Researchers can then discover the most common uses of smart contracts and identify Ponzi schemes by analyzing the code (Ibba et al. 2021; Zhang et al. 2021b). Despite the focus on technical information, such studies have implications not only for developers and computer scientists but also benefit researchers in finance and economics by, for instance, identifying investor interests and customer demands.

The evolution of the blockchain topic is often tied to unique events that affect market activity and trigger changes in investor behavior. Linton et al. (2017), for example, study how blockchain topics change during periods of significant events in the cryptocurrency world, such as the insolvency of the MtGox Bitcoin exchange in 2014 (Goldstein and Tabuchi 2014) and the hack into Bitfinex in 2016 (Baldwin 2016) (e.g., from sole ‘Bitcoin trading’ topics to ‘security issues’ or ‘scams’ as predominant topics in online forums). Other researchers (Daluwathumullagamage and Sims 2020; Pan et al. 2020; Bahamazava and Nanda 2022) incorporate the influence of specific events (e.g., Bitcoin halving events, the introduction of regulations, and COVID-19) into their models to better interpret the change in interest during different periods.

Crime and regulation

Illegal activities and crimes have always surrounded discussions on cryptocurrency. Many early users appraised the (pseudo)anonymity of cryptocurrency and used it as currency for illicit purchases on DarkNet. In the early stages, cryptocurrencies were suggested that cryptocurrencies contribute to improving black markets (Foley et al. 2019).

Bahamazava and Reznik (2022) and Bahamazava and Nanda (2022) explore the posts from Reddit (subreddit DarkNet) to study the criminal topic evolution and the mainstream methods to trade cryptocurrencies illegally. Crime-related texts on other channels such as Twitter, Telegram, and HackForums are also used to identify the specific illegal activities discussed (Barth et al. 2020; Nizzoli et al. 2020; Siu et al. 2021). One rich first-hand source for examining fraud from the victim’s side is the reports from https://www.bitcoinabuse.com, where the victims of Bitcoin fraud share their experiences and post the original messages they received from the abusers. Choi et al. (2022) cluster these messages and find high similarity of a large number of messages, suggesting the existence of only slight modification of fraud messages and certain patterns of the language usages from Bitcoin fraud instigators. Zhang et al. (2021b) apply an improved CatBoost classifier to smart contract codes to find the common characteristics of Ponzi schemes hidden in the lines.

Although studies inspecting illegal activities have accumulated, the number of studies exploring relevant regulations remains minimal. We identified only two studies that explicitly discussed regulatory issues. In the study by Bahamazava and Nanda (2022), after discovering the preferred methods of buying cryptocurrencies for money laundering, they cross-examined anti-money laundering regulations in Italy and Russia to see if they have corresponding paragraphs to address such purchasing methods. Chousein et al. (2020) investigate how service providers of public blockchain systems communicate with their users about the influences of the EU General Data Protection Regulation (GDPR) on their services and find a shortage of communication and transparency on GDPR compliance issues.

There are two reasons for the lack of regulation-oriented text analysis studies. First, the time lag between the introduction of regulations in different jurisdictions limits the availability of data for regulatory studies. Second, analyzing the content of regulations requires a computer program to understand the legal terms. Therefore, context-specific dictionaries are required to correctly extract information. Researchers should also have domain knowledge to interpret the results accurately, which can be challenging in many areas. Nevertheless, because understanding regulatory frameworks is essential to advance our understanding, combat blockchain crimes, and promote blockchain adoption, more research is needed from the perspective of blockchain-related regulations.

Perception of blockchain

The perception of (potential) users is crucial for the development of emerging technologies such as blockchain. Public acceptance does not merely rely on economic benefits, but also on other aspects. Studies have attempted to discover how the public perceives blockchain technology and the drivers of attitude construction. Such studies are closely associated with social and cultural factors and are, therefore, located in interdisciplinary studies, such as behavioral finance. The number of papers was not significant (seven papers) in this review; however, the questions discussed were diverse.

Blockchain was initially surrounded by suspicion and considered a questionable technology; however, its acceptance grew gradually. Users are attracted to the security, privacy, transparency, trust, and traceability offered by blockchain (Grover et al. 2019), but their adoption is still hindered by a lack of blockchain knowledge and distrust of blockchain (Yadav et al. 2021). Doubts can be removed by building channels for the public to gain knowledge about it:1) articles from the media help the public obtain more information about blockchain, which boosts further exploration of the technology and acceptance; 2) existing business problems motivate experimenting with blockchain and enhance trust (Perdana et al. 2021). Cultural background also helps shape the perceived value of blockchain. Grassman et al. (2021) conduct a comparative study between Sweden and Japan on the attitude towards autonomy that cryptocurrency brings. The principle of autonomy has a higher intrinsic value in Sweden, whereas Japan adopts a more pragmatic view of autonomy (i.e., facilitating investment prospects).

In broad-term blockchain, specific products with distinctive characteristics are viewed differently. Some studies (Caliskan 2020; Mnif et al. 2021; Bashchenko 2022) explore the perceptions of Bitcoin, Bitcoin Green, and cryptocurrency exchanges and explained the reasons for their interpretations.

RQ3

What are the research gaps and promising future research topics?

We now summarize the research gaps described in the papers and observed by us and develop future research topics to which future studies could address.

Improvement of data preparation

The quality of the input data largely determines the model output results; however, the complexity of text data makes it challenging to prepare. Many current studies merely conduct standard data preparation and omit the features of different types of text. To prevent “garbage-in-garbage-out”, future research can look more deeply into the characteristics of specific texts and prepare the data in a way that fits the characteristics of the texts.

Data selection After text preprocessing, the text data should be further selected or weighted by considering the text features. This procedure is yet neglected by a substantial number of papers. For example, Twitter offers millions of short texts daily, but misinformation is omnipresent. Bots and fake accounts should not be ignored and should be separated from others (Burnie and Yilmaz 2019; Kraaijeveld and de Smedt 2020). Bashchenko (2022) divides news into two types: (a) endogenous news, which describes the past price movement; (b) fundamental news, which provides information that can have higher impacts. When using news for price prediction, endogenous news should be filtered out because it has a limited influence on future prices.

Another way to improve preparation can be achieved by setting relevance levels for the texts. Twitter accounts can be weighted according to their influence levels (e.g., number of followers, retweets, and user networks) (Jain et al. 2018; Li et al. 2019), and the influence of a patent is reflected by the number of citations.

Dictionary building Dictionaries are essential in text analysis models (e.g., sentiments and topics). However, they are generally only applicable to a specific context since vocabularies can change their meanings depending on discipline (Loughran and McDonald 2011). The impact of using an off-the-shelf dictionary in other areas can be a substance for blockchain studies, as new vocabularies and jargons have been invented in blockchain. Studies have indicated that designing a domain-specific lexicon for blockchain could potentially improve the accuracy of analysis (Balfagih and Keselj 2019; Chen et al. 2019a; Sattarov et al. 2020). existing studies primarily adopt the VADER (Hutto and Gilbert 2014) and LM lexicons (Loughran and McDonald 2011), and only a few studies have developed or integrated blockchain-specific lexicons (Chen et al. 2019a; Barth et al. 2020; Kraaijeveld and de Smedt 2020; Huang et al. 2021).

Extension to underused data and growing areas

In this review, we find a concentration of text data uses from social media, online forums, and academic papers. Simultaneously, many other documents containing valuable information are underused. Corporate-generated documents (e.g., SEC and patent filings) are not frequently utilized despite their importance in revealing corporate-level information. For instance, in finance studies, patent filings are used to identify specific FinTech categories (Chen et al. 2019b, 2022). Studies use 10-Ks for different purposes: product description sections for the new industry set according to product similarity (Hoberg and Phillips 2016), business descriptions for company’s asset specificity(Chen et al. 2022), and risk disclosures for risk detection (Bao and Datta 2014; Hanley and Hoberg 2019). Corporate disclosures are versatile, and cater to multiple research purposes. One limitation of corporate disclosures is that blockchain startups have limited mandatory disclosures. Nevertheless, future research can make greater use of such documents to gain insights into blockchain adoption strategies of established companies.

Another gap in the review is the absence of papers related to the keywords NFT, STO, IEO, and stablecoin. These are relatively new concepts in blockchain and are largely understudied. Researchers investigating these areas will contribute to a better understanding of market mechanisms. For example, potential text data in NFTs include descriptions and social media discussions of NFT items. STOs are treated as traditional securities and adhere to all rights and obligations including approved prospectuses for public offerings. IEO project whitepapers were thoroughly vetted by exchange prior to launch. Therefore, the above documents are more standardized and can be used similarly as standard corporate disclosures. Stablecoin is connected to conventional financial systems and have drawn attention to financial stability issues. News (integrated with event studies) could provide coverage from this perspective.

Regulation

Given the increasing trend of cryptocurrency in the monetary system, government policies and regulations are essential for counteracting risks, restricting illicit activities, and protecting consumers (Chokor and Alfieri 2021).

Many jurisdictions have updated or supplemented their regulatory frameworks to accommodate the existence of cryptocurrencies and other blockchain-based decentralized applications (e.g., Market in Crypto-Assets (MiCA) and Framework for International Engagement on Digital Assets). Issues such as money laundering, terrorist financing, and tax evasion have been extensively recognized and addressed. In addition, organizations such as the International Organization for Standardization (ISO) and the Financial Stability Board (FSB) are working to establish international rules and standards to promote collaboration among jurisdictions. Many proposed frameworks are still in their initial stages or awaiting implementation, and updates can be expected.

Texts used in regulation-related research are not limited to regulatory documents, but also include other texts, such as corporate disclosures related to blockchain or cryptocurrency (SEC 2022), terms of service agreements, and online discussions about regulatory terms. Future research could integrate regulatory factors into the study, examine the impact of regulations on markets in different jurisdictions (Barth et al. 2020), and observe users’ perceptions of and reactions to specific regulations. This could provide insightful implications for practitioners and policymakers regarding the implementation of relevant regulations and how takers of specific regulations will adopt them.

Conclusion

The uncomplicated access and rich information in blockchain-related texts make them ideal for complementing numerical data in research. However, a comprehensive review of this topic to provide guidance for researchers is lacking.

This study addresses this issue by making several contributions to the literature. First, we provide comprehensive summaries of research scope, text data sources, and text analysis methodologies in the existing literature to guide researchers in finding pertinent resources. Second, we go beyond individual elements and exhibit the connections between them. We conflate the above elements and display the two most frequently used combinations: (1) papers focusing on cryptocurrencies conduct sentiment analysis on posts from instant user-generated content or news articles to find the correlations between sentiment and market behavior, and (2) papers examining the concept of blockchain use formal documents to apply topic modeling to discover classifications and trends. We emphasize that it is crucial to choose appropriate combinations considering variable perspectives, such as data characteristics and research questions. Finally, we integrate blokcchain-related research areas and text analysis approaches into a joint framework. By not restricting our search to one discipline, we are able to capture the use of text analysis in non-technical blockchain studies across disciplines and provide multiple perspectives on the topic. We highlight five major research topics discussed in the literature: relationship discovery, cryptocurrency performance prediction, classification and trend, crime and regulation, and the perception of blockchain. Furthermore, by referring to individual papers and aggregated information, we uncover three future research topics that researchers can explore: improvement of data preparation, studies with underused data and growing areas, and regulation-related research.

We are aware that this review shares publication bias of literature reviews. Studies with statistically significant results are more likely to be published, leading to a publication bias (Rosenthal 1979). To alleviate the impact of bias, we searched the most comprehensive databases for peer-reviewed papers and chapters. We also included unpublished working papers on SSRN in keyword searches. Backward snowballing was conducted on the included papers to identify more papers that did not appear in the keyword searches. We believe that through our multiple procedures for identifying targeted papers, we obtained a comprehensive collection of papers for this literature review.

Despite this limitation, this study provides a timely academic-oriented review of the text analysis approaches used in blockchain research. Our detailed summaries will help researchers navigate specific text data types and methodologies. The findings of the current research landscape and suggested future directions could facilitate the selection of promising research topics and the implementation of suitable methodologies for their analyses. Overall, this review will be useful for researchers from various disciplines interested in exploring large-scale text data in blockchain-related research.