1 Introduction

Trade is a fundamental economic act that involves buying, selling or exchange commodities and services between participants that can be individuals, business entities or national organizations. Foreign trade occurs when these participants belong to different countries with different trade policies (Lawrence et al., 2018). In this context, foreign trade can be negatively impacted by various trade-barriers. The latter refers to the set of government-induced policies that put restrictions on international trade (Lawrence et al., 2018) by making the trade either more difficult or expensive through some tariff or regulation like barriers, or even by fully forbidding the entry of some specific goods / services as in case of trade embargo or economical sanctions. Motivations for introducing trade barriers are rooted back to government need to protect infant industries from international competitions and promote local business actors to gain competitiveness or maintain high employment rate. Nevertheless, in the age of internationalization and globalization, economists commonly admit that trade barriers are against free-trade policy advocated by the World Trade Organization (WTO) and undermine investors as well as small- and medium-sized enterprises (SMEs) who seek long term stability, predictability and economic profitability in their international markets. Although, as a result of the General Agreement on Tariffs and Trade (GATT), tariff-barrier is left at quite low levels, the total trade costs remain relatively high (Anderson, 2004). This is due to the growing impact of non-tariff barriers, which, include, among others, product regulation, border inspections, transportation, access to efficient distribution channels, time delay, cultural and language clues, among others. Therefore, timely and efficient identification of such trade-barriers is of paramount importance for both business entities and regulator agencies. Indeed, as pointed out in (Bahja, 2020), this contributes to reinforce E-Governance duties in terms of promoting transparency in international business and supporting local business operators in easing access to international market opportunities (Ghosh, 2009). Moreover, shedding light on trade barriers provides valuable input data to the World Trade Organization (WTO) (Global trade, 2021) that will foster the development of more evidence-based policing. Strictly speaking, national agencies and European Union (EU) institutions maintain a database of trade-barriers that can be made available for their export enterprises. This includes, for instance, the Market Access Database and Access to Market Portals (European Commission. Market Access Database. 2021) in the EU, and the monthly reports of The Office of the United States Trade Representative (USTR) in USA (The Office of the U.S. Trade Representative (USTR). FOREIGN TRADE BARRIERS., 2021). However such databases often contain a high-level description of the trade-barrier according to the agency own categorization when appropriately reported by business actors and lack a fine-grained description that might be of a direct interest to a random investor or SME. Therefore, identifying efficient approaches for uncovering such barriers remain a constant priority for both trade agencies and business actors. This motivates the current paper, which advocates a novel approach based on natural language processing (NLP) and deep learning technology to mine the cross-border trade-barriers between Finland and Russia in fish industry. In essence, this study uses commerce discussions (feedback, questions, and complaints) extracted from Finnish and Russian specialized business discussion forums to uncover trade barriers faced by Finnish and Russian entities (individuals, companies, and government agencies) in fish industry sector. As discussed in (Bahja, 2020), a such study contributes to reinforce E-Governance duties in terms of promoting transparency in international business and supporting local business operators in easing access to international market opportunities. Moreover, shedding light on trade barriers provides valuable input data to the World Trade Organization (WTO) in a way to foster the development of more evidence-based policing.

Overall, this research aims to contribute to the task of automatic identification of trade-barriers from unstructured textual sources. We confine our trade-barrier analysis to the trade-barrier ontology provided by Onkivisit & Shaw (Onkvist & Shaw, 2009), which advocates 11 classes of Tariff and Non-Tariff barriers. We shall refer to this classification as OS-K trade-barrier ontology. More specifically, this research addresses the following three research questions:

  • RQ1: How to automatically identify the OS-K trade-barriers from unstructured textual inputs?

  • RQ2: How to automatically identify the discourse and causes related to each trade-barrier type?

  • RQ3: How to evaluate the results generated by a such automatic trade-barrier identification system?

We restricted our study to the Finnish-Russian trade-barriers in fish industry according to dataset provided as part of an European CBC Karelia research project. This paper advocates five-fold contributions.

  • A comprehensive review of NLP applications in business sector has been conducted, highlighting the key pitfalls in this area to address the problem of automatically identifying trade-barriers from textual sources. This is carried out in the background section of this paper.

  • A three-leg ontology, consisting of barrier-business-trade ontology, for querying Finnish and Russian forum dataset has been put forward. This enables us to match user textual inputs to OS-K trade-barrier ontology.

  • A new architecture that uses state-of-the-art BERT model and reverse engineering clustering model has been devised to identify the various trade-barrier types as well as their associated causes and discourse in Methodology section. This answers research questions RQ1 and RQ2.

  • Exploratory and visualization-based approaches have been conducted to interact with the user and contribute to the evaluation of the findings, which answers research question RQ3.

  • A comparison of the findings with previously identified Finnish-Russian trade barrier report further sheds light on the statistical assessment of the model and, thereby, contributes to RQ3 as well.

Section 2 of this paper presents the background information in terms of use of NLP technology in business sector for the purpose of trade barrier identification. It also provides a brief outlook of the adopted OS-K ontology. Section 3 describes the dataset employed in this study. Section 4 details the method and the data pipeline employed to answer the three aforementioned research questions. Results and discussions are reported in Section 5. Trend comparison with economic statistics is carried out in Section 6. Finally, research implications and concluding statements and perspective works are reported in Section 7 and Section 8, respectively.

2 Background

It should be noted that surprisingly the problem of automatic identification of trade-barriers from textual sources is much overlooked in both business and information processing research communities where no well cited reference can be identified as far as our investigation is concerned. This is probably due to inherent challenges of NLP tasks involved into this process as well as the scarcity of data and multidisciplinary aspects. Therefore, we first explored the state-of-the-art of NLP research in business sector with the objective of identifying related research that can contribute to our problem. Then we explicitly provided a background of the OS-K trade barrier ontology that we adopted in this paper.

2.1 Natural language processing in business sector

Natural language processing is becoming the new standard for business intelligence where self-service analytics service providers, e.g., IBM Watson, are increasingly adding NLP uses to their platforms to enhance search experience and go beyond string matching. NLP also helped developing enhanced personalized services to account for individual user’s profiles generated through continuous monitoring of his purchases and actions as well as capitalizing on user’s review and feedback, taking into account the inferred sentiment and rationality as in (Bounab et al., 2020), (Eisenstein, 2019), (Tarnowska & Ras, 2019), (Tarnowska & Ras, 2019), (Tarnowska & Ras, 2021) (Krouka et al., 2021). Besides, with the developments in e-commerce and finance applications, NLP research contributed further in the emergence of sophisticated decision-support and recommendation systems that efficiently interact with customers and provided relevant recommendations to the user (Tarnowska & Ras, 2021). A simple observation of the functioning of the very popular Amazon and Netflix recommendation systems that build on user’s profiling, which explores user’s past purchases, navigation history, product contents, rating and recent news update, reveals the success of such applications in our daily-life. Chatbots or virtual assistants become part of business ecosystem in many large scale or even SMEs where the interaction with customers is part of their core business. In this context, chatbot is found to substantially increase company initiated communications, mine customers’ needs and interests, raise awareness, guide users to other selected service providers and achieve high level of cost effectiveness (Luo et al., 2013). For instance, Public Tableau Ltd. suggests that 30% of customer service positions in the US can be automated using chatbots, resulting in $ 30 billions annual saving.

Data collection of either structural or unstructured textual inputs is another business application of NLP. It can constitute the core business of the whole company as in AlphaSense (Technologies, 2021). Without claiming an exhaustive coverage of the NLP applications in the business field, a summary of the dominant practices in this area is presented in Table 1. In essence our survey of the existing literature in the field of NLP-based business revealed the following (see Table 1). First, there is a lack of comprehensive and up-to-date survey papers in the field within information processing community as testified by the identified literature, where several key publications were rather published in sparsely forums and communication channels. Second, the dominance of sentiment analysis literature for exploratory-analysis, prediction or even business strategy is surprisingly very striking. From this perspective, identifying trade-barrier can be initiated by further exploring posts with negative polarity, or tracking special vocabulary that translate our interpretation of trade-barrier. Third, there is a recognition that chatbots and conversational assistants will become state-of-the art in the next decade, acting as a cross-edge where many AI technologies ranging from speech recognition to business analytics passing through various components of NLP, deep learning technology, knowledge discovery, dialogue constructs and emotional monitoring can be applied to infer and/or maintain high focused conversations with potential customers. From this perspective, chatbot data can also be utilized to infer insights regarding potential trade-barrier by monitoring utterance conveying user’s discomfort and feedback. Fourth, many business related domains are unfortunately left unexplored. Among these domains, one shall mention the automatic identification of trade-barriers for business entities, which motivates our current work in this paper. There are increasing theoretical frameworks from AI, modal reasoning, approximate reasoning, causal reasoning (Morgan & Winship, 2007) that might be utilized for trade barrier mining. Nevertheless, their success in practical NLP applications is rather limited and further research still needed to narrow the gap between theory and practice. Sixth, the challenges in applying NLP technology to business sector cannot be ignored. Indeed, this is rooted back to the challenges of text analysis where inherent ambiguity, dynamic change of contextual discourse, presence of noisy information, use of metaphors and other complex lexical constructs render the application of standard NLP pipeline open to debate (Bahja, 2020; Lagi, 2019; Mitchell et al., 2014; Techlabs, 2018). This calls for subsequent research and exploration.

Table 1 Applications of NLP in business

2.2 Trade barriers classification

Several trade barriers classifications have been introduced by several scholars and regulators. Grübler and Reiter (Grübler & Reiter, 2021) in their classification emphasized three different forms of trade barriers: Tariff, Non-Tariff and Free-trade. In the same spirit as World Trade Organization (WTO) categorization, Onkivisit & Shaw (Onkvist & Shaw, 2009) classified trade barriers into two main classes: Natural trade barriers that deal with non-human causes such as cultural or language cues, and Artificial barriers that encompass Tariff and Non-Tariff barriers. Tariffs barriers involve a financial burden on imported commodities and include six different attributes:

  • Direction (1): focuses on the discrepancy between import and export movement of goods/services.

  • Purpose (2): refers to the reason for introducing Tariff extra duties. This is usually related to compensation for local industry or additional tax revenues to protect domestic economy from foreign competitors.

  • Time length (3): distinguishes Tariff surcharges which are temporary and limited in time, and those which are permanent.

  • import restraints (4): correspond to special and variable duties associated with frequent changes in world market prices.

  • Tariff rates (5): correspond to duties that vary per unit of weight, volume, gauge or other indicator of quantity on the good crossing the border. This includes three tax rates: specific, ad valorem and combined.

  • Distribution points (6): correspond to indirect taxes collected at a critical point of distribution or while a purchase or a consumption occurs.

Non-tariff barriers force other overt and clandestine means to restrict imports and occasionally exports. It includes five different attributes:

  • Government participation in trade (7): includes administrative guidance to restrict foreign purchases to a certain percentage of local demand, government procurement controlling commercial operations and subsidies involving governmental support domestic producers(Onkvist & Shaw, 2009).

  • Customs and entry procedures (8): consist of a set of enforcing procedures that add extra difficulty and time to import the goods. This includes preliminary valuation, documentation, license, inspection, and Health and safety regulations.

  • Product requirements (9): include product standards related to physical characteristics and/or product performance, product testing and packaging, labelling, and national safety regulations prior to allowing the product to market.

  • Financial control (10): involves monetary policies to control capital flow to protect local currency. It includes exchange control, multiple exchange rates, prior import deposits and credit restrictions, and profit remittance restrictions.

  • Quotas (11): refer to quantity control on imported and exported goods (Onkvist & Shaw, 2009; Schaffer et al., 2014).

Figure 1 shows the above trade-barrier classification, referred OS-K trade barrier ontology. This classification is adopted in our study because of its ease, clarity and also due to the fact that it agrees to a large extent to the classification employed by the Finnish Ministry of Foreign Trade Footnote 1

Fig. 1
figure 1

Onkivisit & Shaw and Kananen (OS-K) Trade Barrier Classification

As alternative to the aforementioned classification, one shall mention the classification introduced by USTR, which is yearly updated utilizing inputs from US foreign embassies (The Office of the U.S. Trade Representative (USTR). FOREIGN TRADE BARRIERS., 2021), the European Market Access Database (MADB) (European Commission. Market Access Database. 2021), which provides valuable insights to European exporting SMEs.

3 DataSet

In this study, as part of the European Finnish-Russian project, operated by CBC Karelia organization, which aims to foster joint Finnish-Russian SMEs in the fish industry sector, two datasets have been utilized (from Finnish and Russian side). The datasets reflect the wide community interest, including professionals, users and SMEs in fish industry ecosystem (i.e, fish marketing, transportation, production and consumption of fish products) in both Finland and Republic of Karelia. The overall database has been shared under the data-sharing and usage privacy agreement. The database has been elaborated in the first quarter of 2021 and contains historical data as well.

3.1 The finnish dataset

We used the Suomi24 forum (www.suomi24.fi) because of its popularity and its wide spread coverage including business related discussion. We selected the Seafood section of Suomi24 as it was found to contain interesting fish business related to discussions. We crawled all data linked to seafood thread, which span from 2021 till 2014. This dataset is constituted of 14701 text entries between topic titles and comments written by users in different languages such as Finnish, Swedish, Hebrew, Spanish, and some English. Table 2.(a) shows some statistics related to the Finnish dataset.

Table 2 (a) Finnish Dataset Statistics and (b) Russian Fish farming dataset Statistics

3.2 The russian dataset

The dataset is generated by crawling 13 different Russian fish farming business forums. The detail of the dataset is summarized in Table 2. The collected dataset contains around 1000 messages from the 375 users registered to the 13 Russian fish-farming online discussion forums, covering 114 topics in overall. The dataset includes time attribute, indicating the date where the post has been posted (from 2010 till 2019). Table 2.(b) shows the list of websites used for building the Russian dataset and their statistics.

4 Methodology

4.1 Rationality and overall architecture

Given the nature of the input data, a combination of NLP, deep-learning technology, and clustering approach has been adopted to answer research question RQ1 and RQ2. Prior to detailing the technical architecture used to handle the above research questions, a set of rational assumptions have been adopted for this purpose.

Assumption 1

Trade-barriers are rather associated with negative polarity posts.

The preceding draws on the intuition that when users discuss about the trade-barriers, they often convey negative sentiment posts. Therefore, a sentiment analysis approach can be used as a preliminary exploration to identify potential trade-barriers from the discussion forums.

Assumption 2

The association of a single post to trade-barrier types is not mutually exclusive.

This draws on the inherent ambiguity pervading natural language so that it is not fully excluded that a given post will be associated to more than one trade-barrier class among the 11 classes of the (OS-K) ontology.

Assumption 3

The causes of each trade-barrier type can be revealed from the analysis of the discourse built from the constructed corpus associated to this trade-barrier.

The preceding confines the causality analysis, whose scope often goes beyond to linguistic and semantic aspect to complex philosophical issues, to the output of discourse-based analysis. This motivates a clustering based approach followed by a manual checking step for this purpose. Figure 2 provides a generic pipeline for identifying the various trade-barriers and their causes. The approach highlights at least four distinct phases: Preprocessing, Data collection, Trade-barrier mapping, Causes identification.

Fig. 2
figure 2

Generique pipeline

4.2 Preprocessing

Initially, since the most powerful tools used in NLP are in English, all the posts were translated into English language using Google Translate (GT) application programming interface (API). Although, we acknowledge that such a translation may overlook important messages conveyed by Russian blogs, our initial exploration revealed that this loss is rather marginal and still all named-entities and verbal statements were unaffected by this translation. This also testifies of the efficiency of Google translation, which is widely adopted in professional forums. Strictly speaking, other empirical studies showed that translation API often fail in case of metaphors (Solutions, 2017), culinary texts (Hasyim et al., 2021), or when attempting to capture writing style and demographic attribute. Nevertheless, these aspects are less apparent in short text messages of our database, which partly explain the efficiency of translation API in our case.

Given the nature of unstructured text collected from the Finnish and Russian forums, which contain several uncommon characters and noisy terms that can deviate the outcomes of the parsers and other NLP modules, preprocessing stage is required. This goes through successive steps to clean out all the noise in the retrieved textual data. Especially, this includes the following:

  • Remove URLs and emails;

  • Replace combined tokens by separate ones, e.g., “hasn’t” becomes “has not”;

  • Remove distracting single quotes;

  • Remove punctuation and Numbers;

  • Remove extra spaces.

4.3 Ontology for data collection

From the raw dataset (Finnish and Russian), there is a need to identify useful subset of data that is relevant to our purpose of mining trade-barrier evidence only. For this purpose, we have created a three-leg ontology to define the keywords needed for querying the database.

  • Barrier ontology: represents keywords that define the concept of “Barrier”.

  • Business ontology: represents keywords that define the concept of “Business”.

  • Trade ontology: represents keywords that define the concepts of “Import” and “Export”.

To build the keyword list associated with each ontology-leg, we follow a three-stage process. Initially, a manual crafting of wording associated with each ontology-leg are drafted. This is followed by an automatic expansion using WordNet synonymy relation and then a purification and manual checking stage is carried out for potential inconsistency detection. Table 3 summarizes each component of this leg-ontology construction and its associated keywords used for querying the database.

Table 3 Ontologies used for search

We have devised a three-level query system to search for all possible matches within the Russian/Finnish database.

  1. 1.

    The shallow search: performs the keyword search operation within the titles (threads’ names of the discussion) only of the documents.

  2. 2.

    The vast search: extends the keyword search operation to both title and main text body of the topic.

  3. 3.

    The deep search: performs the keyword search operation on title, main text body, and replies/answers of each topic.

Algorithm 1 describes the different steps used for querying the retrieved datasets utilizing the predefined ontology-legs. In practice, we have only used the “vast” and “deep-search” modes with the three-leg ontology mentioned above, where 84 keywords have been employed. This generated 14701 output files for the Russian dataset and 3636 output files for the Finnish dataset, which are then used as input to our reasoning pipeline detailed in the next section. We shall denote by ω the set of posts resulting from the application of this search operation.

figure a

4.4 Trade-barrier identification

As pointed out in the previous section, the newly extracted subset of dataset, which best accommodates the three-legs ontology (Barrier, Business and Import & Export) that encapsulates the concept of cross-border trade-barrier is now utilized for subsequent reasoning. For this purpose, in light of Assumption 1 and Assumption 2 endorsed in this study, the task of identification trade-barrier type undergoes three distinct steps:

4.4.1 Sentiment analysis

Given the assumption that barriers are negatively perceived by the users, sentiment analysis has been used to identify negative posts from the subset dataset Ω generated by the application of the above three-leg ontology. We used SentiStrength package (Thelwall et al., 2010), which is found to have human-level accuracy for short social web text, to identify sentiment polarity of each post. It reports two sentiment strengths E+ for negative sentiment and E+ for positive sentiment in the scale (-4 to + 4). Therefore, each ith post pi of ζ such as E+(pi) +E(pi) \(\leqslant \) -1 is considered to bear a negative polarity, and will thereby be included in the newly created subset, denoted ζ, where ζ \(\subseteq \) Ω. In other words, ζ contains only posts that more likely include discussions directly linked to trade-barriers. The next step concerns the identification of the type of the trade barrier associated to each post in ζ.

4.4.2 Trade-barrier type matching process

To uncover the type of trade-barrier associated to a given post, a deep learning-based approach has been designed. This consists of encoding the definition of each of the 11 trade-barrier types as well as each post in ζ using the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) architecture (Devlin et al., 2018) released by the Google research group in 2018. Figure 3 presents a high level graphical illustration of this process where pre-trained BERT model is employed.

Fig. 3
figure 3

Trade barriers embedding

We mainly use BERT as an encoder for text representation, which yields sentence vector representation for each trade-barrier type definition, say, Hi, i = 1 to c, and each post pi, i = 1 to M, where c and M stand for the total number trade-barriers types (which is 11 as per the OS-K employed ontology) and the number of posts in ζ, respectively. More specifically, we have:

$$ {H^{s}_{i}} = BERT(H_{i}), i = 1 to c $$
(1)
$$ {p^{s}_{j}} = BERT(p_{j}), j = 1 to M $$
(2)

where \({H^{s}_{i}}\) and \({p^{s}_{j}}\) are the vector representation of the ith trade-barrier type and jth post in ζ, respectively, corresponding to their associated BERT embedding. The use of BERT embedding instead of Glove (Pennington et al., 2014) or word2vec embeddings (Mikolov et al., 2013) is motivated by several grounds. First, the BERT model builds on the state-of-the-art Transformer architecture in NLP, trained over 2.5 billions of words and uses bi-directional learning to gain context of words from both left to right context and right to left context simultaneously, which offers a substantial ability to account for the context and various linguistic constructs of the input word sequence (Vaswani et al., 2017; Renjith et al., 2020). This intuitively offers an edge over any semantic similarity like analysis. Second, the configuration of BERT architecture enables relatively large word-sequence to be treated as a single input, which captures all the information conveyed by this ordering word-sequence. This contrasts with word2vec or Glove embedding whose model assigns a separate embedding vector to each individual word of the sequence, which intuitively leaves the ordering information of the sequence fully ignored. Third, the model offers reasonable flexibility in fine-tuning the model hyper-parameters without important complexity burden. Now given that both the trade-barrier type Hi and post pj are represented in real-valued vector representation of same size, the matching between the two entities can be assessed using standard techniques well established within the scope of information retrieval (IR) field (Salton & Harman, 2003) which covers, among others, the process of matching between a given query and a set of documents. In this respect, in line with vector-based information retrieval methods, we use a simple inner product or a cosine similarity to assess the similarity between Hi and post pj. More specifically, the similarity between Hi and pj is calculated as:

$$ Sim(H_{i}, p_{j})=Cosine ({H^{s}_{i}}, {p^{s}_{j}})=\left (\frac{{H^{s}_{i}} \bullet {p^{s}_{j}}}{\left\|{H^{s}_{i}}\right\|\left\|{p^{s}_{j}} \right\|} \right), i = 1\ to\ c; j = 1\ to\ M $$
(3)

The application of Eq. 3 over all posts and trade-barrier types yields a similarity matrix M with size c x M and whose values lie in the unit interval.

Next, in order to find out whether a given post can be assigned to a given trade-barrier type, often thresholding-like approach is employed. For this purpose, we adopt a two-stage strategy.

  • Acceptance threshold It corresponds to the minimum similarity value below which the agreement between a given post and a given barrier type is deemed void. In other words, if there is no posts p such that Sim(p, Hk) is beyond the acceptance-threshold, then the barrier type Hk is considered not reflected in the existing pool of posts. We empirically set the value of acceptance-threshold to 0.5.

  • High-Frequency-based threshold After the initial screening of the similarity values, which discards those posts whose similarity value is below the acceptance-threshold, this phase further scrutinizes the restricted range of permissible similarity values of each trade-barrier type to select a final threshold beyond which the post is deemed to be assigned to the underlining barrier class. For this purpose, we proceed in two phases. First, for each column of the similarity matrix M, corresponding to a given trade-barrier type, we create a subdivision of the minimum and maximum values of the similarity values. Second, the bin value, which contains the largest number of items and relatively high similarity value is selected as a final threshold 𝜖. The rational behind doing so is driven by our desire to ensure some flexibility in terms of post-barrier type association in order to ensure the existing of a sufficient number of posts in each trade-barrier type. This contrasts with standard approach often adopted in information retrieval literature where the maximum similarity value is often adopted as a criterion for the association. On the other hand, this is also motivated by the subsequent task of the cause investigation where one desires to have a corpus associated to each trade-barrier type for both Finnish and Russian dataset. More formally, the corpus \(\mho _{i}\) of each trade barrier type Hi is defined as:

    $$ \mho_{i} = \left \{ p_{j}: Sim\left ({p^{s}_{j}}, {H^{s}_{i}}\right) \geqslant \epsilon \right \} i = 1 to c $$
    (4)

For a detailed implementation and the employed trade-barrier definitions, one shall refer the reader to the Github repository of the project.Footnote 2

4.5 Trade-barrier cause identification

This subsection aims to comprehend the reasons behind each trade barrier type in both Finnish and Russian corpora. For this purpose, a causality analysis has been carried out. Although, the field of causality detection from textual is an active research field in both computational linguistics and computer science communities where several linguistic rule-based, supervised and unsupervised machine learning methods as part of competitions and challenges, e.g., SemEval2007, SemEval2011 (Keith et al., 2020). Nevertheless, applying most state-of-the-art approaches to our case is rather challenging for at least two reasons. First, the quasi-majority of the state-of-the-art machine-learning approaches are domain-specific. Therefore, attempting to apply them to our dataset will be intuitive of little relevance. Second, the quasi-majority of causality literature instead focused on identifying the cause-effect relationship. We are only interested in comprehending the causality aspect without paying importance to the "effect" parameter. We, therefore, designed a semi- supervised approach to handle the causality aspect. This consists of at least five phases:

  • For both Finnish and Russian, we take the set of posts associated to each barrier type as an input; namely, \(\mho _{i}\), for i = 1 to c.

  • We use the list of tokens formed by all posts in \(\mho _{i}\), i = 1 to c, to construct a TF-IDF vector representation of each post pj in \(\mho _{i}\).

  • We perform a k-means clustering (Rong & Liu, 2020) of the posts pj in \(\mho _{i}\) using their associated TF-IDF vector representation. Then, we report the centroids Lk of each cluster.

  • We perform the reverse-engineering operation from TF-IDF to textual representation to transform each centroid Lk into its textual representation, say,Tk.

  • We hypothesize that the content of the textual centroids bears to a large extent the required information to comprehend the cause of the barrier type occurrence.

  • We interact with the operator that reads the content of the centroid for each barrier type for subsequent reflection, if any.

The preceding allows us to extract more detailed reflections about the trade barriers’ content, revealing and summarising the causes of the presence of the detected trade barriers. Figure 4 summarizes the five steps used for Trade Barrier Causes Identification.

Fig. 4
figure 4

Text clustering

4.5.1 Evaluation

In the absence of ground truth information that would allow us to perform more quantitative analysis of the usual information-retrieval adopted performance metrics, and in order to address RQ3, we mainly relied on two main approaches for this purpose:

  • We performed exploratory and visualization-based analysis as a way to seek the relevance of the outcomes.

  • We performed a brief comparison analysis with previously available reports regarding the Finnish-Russian trade barrier.

5 Results and discussions

5.1 Exploratory analysis

Initially, we performed an exploratory analysis to comprehend the Finnish and Russian overall corpora content using WordCloud visualization, which helps identify its core contents in terms of word frequency and aesthetic features, providing general insights on the most discussed topics. Figure 5 (a) and (b) show the WordCloud representation of Finnish and Russian Corpora, respectively. From Fig 5 (a), one may notice that the Finnish dataset is rather dominated by fish and related seafood (food, eat, salmon), food taste, pricing, and various fish recipies. This indicates the tendency of the users in this forum to act as fish consumers instead of international business dealers. However, discussions related to Russian imported fish and Russian fish recipes are highly present as well.

Fig. 5
figure 5

Most frequent words in (a) Finnish Suomi24 Forum and (b) Russia Forums

It is noticeable from WordCloud visualization that this is not enough to reveal insights in terms of trade-barrier identification and answer our research questions. This motivates the use of other tools such as sentiment analysis, similarity calculus and clustering analysis to help grasp other hibernated aspects in both datasets.

5.2 Sentiment analysis results

Initially, we want to quantify the extent of positive and negative sentiment in the initial Finnish and Russian dataset using SentiStrength sentiment analysis method. Table 4 summarizes the results of Finnish and Russian sentiment scores in terms of number of posts in each sentiment polarity.

Table 4 Summary of sentiment polarity analysis

It is noticeable that negativity is the most dominant sentiment polarity for the Finnish side, followed by the neutrality and positiveness polarities. In contrast, neutrality appears to be the dominant sentiment polarity for the Russian side, followed by positiveness and negativity. Since the core methodology of our model builds on the handling of posts with negative polarity, we can therefore notice that the amount of such datasets is subsequently higher in the case of the Finnish dataset compared to the Russian dataset.

5.3 Trade-barrier identification results

Critical in implementing the trade-barrier type identification is determining the high-frequency-based threshold 𝜖 in expression (4). In order to highlight the convenience of our choice, Fig. 6 shows the result of assigning individual (negative) posts to each tariff and non-tariff barrier type for the Finnish and the Russian dataset, respectively. In the chart related to the Finnish dataset, there is a high discrepancy in values of tariff barriers compared to non-tariff ones. In contrast, this trend is no longer valid for the Russian dataset.

Fig. 6
figure 6

Detected trade barriers in the Finnish dataset

For the tariff barriers, it is notable that most Finnish companies suffer from the length of time their exported commodities last before they reach the final consumer with 14.74%. On the other hand, one can see a significant presence of barriers linked to rates, imports restraints, and distribution points with an average of 35%. This can be explained by the unusual increase of direct taxes imposed on the imported/exported commodities that do not follow the changes in world market prices and fair competition. Moreover, it can also be related to the indirect taxes collected at different points of distribution. Furthermore, Fig. 7 shows a sallow presence of direction-barrier with 5.49%, possibly due to WTO constraints and lack of active professional import/export in the forum.

Fig. 7
figure 7

(a) Most frequent words in Finnish trade barriers causes, (b) Most frequent words in Russian trade barriers causes

Furthermore, the proportions of the non-tariff barriers are close to each other, with a level of 10.12% for government participation in a trade associated with administrative guidance, government procurement and state trading, and Subsidies, besides the financial control with 8.67%, which may include monetary policies to control capital flow. Customs and entry procedures with product requirements also showed a high occurrence of 8.09%, which involve eligibility in terms of paperwork and worthiness of the product. Overall, the “time length” is the most frequent one in tariff barriers, while “the government participation in trade” is the most frequent non-tariff barrier.

On the other hand, Fig. 7 shows that in the tariff part, most of the Russian companies suffer from the direction tariff of the imported or exported products with 13.62%, followed by time-length and rate-change barrier with 9.32% and 10.12%, respectively. On the non-tariff part, it is clear that product requirements are the most critical barrier with 13.62%, followed by government participation in a trade with 10.75% and customs and entry procedures 9.68%. Finally, the quotas with 6.45% and the financial control with 5.02% were the lowest present barriers. Overall, the direction barrier is the most frequent one in tariff barriers, while the product requirement is the dominant one in the non-tariff barriers.

5.4 Causes identification result

We applied the K-means clustering algorithm on each encoded corpus associated with each barrier type into five clusters K = 5 with seven keywords only. Since the preliminary exploratory analysis of the influence of individual K-values, while considering the interpretation aspect, a unique centroid (K = 1) will fail to grasp the diversity aspect that may involve any rational explanation of the cause. Unlike a large K-value renders the tasks of finding appropriate interpretation rather tricky. In essence, we extracted the centers of each cluster after reverse engineering from TF-IDF representation to token-level representation while removing unnecessary words to understand the scope of the remaining words used for each trade barrier type. Finally, we aggregate the remaining words to build a bag of words associated with each barrier type to capture its associated discourse as a way to reveal the causes of such a barrier. In this regard, the result shows the following:

  • Direction barrier: seem to be caused by importing Norwegian salmon, bream and frozen Russian tuna in Christmas season with different prices (bream, frozen, tuna, Russian, salmon, Norwegian, price, different).

  • Distribution points: seem to be caused by imposing different prices on salmon, Norwegian fish pike and Russian tuna at different import lines (Russian, Norwegian, fish pike, salmon, price, tuna, report, public).

  • Import restraints: seem to caused by imposing different prices on the Russian’s imported fish products which reflect on the seafood prices (salmon, steaks, tuna, frozen, bream, herring, smoke, important products).

  • Purpose: seems to be linked to the discrepancy in fish product selling points and imposing tariff rates at different points (spice, salmon, price, pay, frozen, herring, rainbow trout, tuna).

  • Tariff Rates: caused by imposing different tariffs leading to different prices of both fresh and frozen Norwegian fish in different seasons, particularly at Christmas (frozen, salmon, prices, Norwegian, shrimp, trout, Christmas).

  • Time length: seems linked to difficulty to accommodate extended period of Christmas discount on Russian and Norwegian fish products where this is no a such extended discount on the imported products.

  • Customs and entry procedures: linked to the understanding of the working procedures of buying Russian fresh and frozen tuna and time requirement.

  • Financial control: discusses about the vendace through applying different prices for buying Russian fresh and frozen fish products (salmon, price, fish, rainbow, products, Russians, frozen).

  • Government participation in trade: discusses the official trade of fish policies (offered, common, official, trade, species, salmon).

  • Product requirements: discusses the quality standards required for trading fish products such as frozen or fresh or ever half frozen (Russian, fresh, half frozen, frozen, products).

  • Quotas: linked to the limited quantity of rainbow trout and salmon sold in Finland (tuna, fish, sold, year, Finland).

Similarly, interpreting the Russian barriers causes, one may notice the following:

  • Direction barrier: seems to be caused by the tariff discrepancy of the imported vs exported fish and fish farming equipment to make fish ponds (rubles, pump, pond, battery, tell price, fish source).

  • Distribution points: caused by the variation in import/export costs of fish products as well as the additional costs incurred in winter season (company, aquarium, fish, eat, case winter, possible, come, costs).

  • Import restraints: linked to import restriction on special pipes, which are needed in construction of artificial small ponds for fish farming and their equipment (pipe, small ponds).

  • Purpose: concerns local industry protection by slowing down the import process (fish pond, day rest, need right).

  • Tariff Rates: linked to the currency fluctuation at various seasons on prices of fish pond and pump in winter as well as the main imported unit (rubles, fish pond, pump, winter, season, main source).

  • Time length: linked to the time length the amount of (tariff) lasts when importing fish pond, pump and other equipment (fish pond, pump, need, money, rest).

  • Customs and entry procedures: concerns the understanding of the working procedures in fish farming company and the time needed for import/export goods (work, company, fish, pond, understand, write, time).

  • Financial control: related to rate discrepancy for buying tolls for building fish bond faced by companies and fish breeding (company, fish, fishing, injector, pump, money).

  • Government participation in trade: concerns the different instruments supported by the government to help elevate this business area through different support mechanisms(organization, pump, pond, fish, pipe, hole, fence, batteries).

  • Product requirements: concerns company’s needed documentation as well as fish feed quality standards to ensure either selling their fish products or building their fish ponds (right, need, work, time, company, fish, pond, farm, fence, excavator, organic).

  • Quotas: caused by the limited quantity of fish pond in the winter each year, which reflect on fish production (fish pond, winter, year, quotation).

The overall summary of all the causes in terms of WordCloud representation is shown in Fig. 7 (a) and (b) for Finnish and Russian corpus dataset, respectively.

6 Trend comparison with economics statistics

In this section, we want to ascertain how realistic our findings are, compared with the Finnish-Russian official trade barriers statistical reports. Therefore, We used the report from the Finnish trade ministry (Nieminen et al., 2014) because of its detailed classification and ease link with OS-K ontology. It is worth pointing out the following. First, there is a critical time gap between the official report (2013) and the current study, which only marginally includes some historical data dating back to the reporting period, as forum data are overwhelmingly dominated by recent posts up to 2-3 years. The purpose of this analysis is to comprehend the trend in trade-barrier occurrence and disappearance rather than to perform a one-to-one correspondence between official statistics and our forum-based analysis. In addition, tackling identified trade barriers by a given country may take several years. It acknowledged that several identified trade barriers by WTO, especially when dealing with access to the Russian market, were well documented by WTO decades ago but are still subject to Russian-west intense economic tensions. This provides ground for this comparative and trend analysis. Finally, one shall note that in (Nieminen et al., 2014) report that the food industry (including the fish sector) represents only 5.9% of the total barriers in Russia. Therefore extra caution is needed when analyzing such results. Figure 8 illustrates the Finnish-Russian various cross-border trade barriers as reported (Nieminen et al., 2014).

Fig. 8
figure 8

The official reported trade barriers between Russia and Finland in 2013

In this plot, a slightly different employed categorization compared to OS-K ontology. Therefore, to ease this comparison, a mapping between this categorization and OS-K. Table 5 shows each report’s categorization class and its potential trade-barrier types in OS-K. Because of the insufficient reason in statistics and the absence of any further prior knowledge, the proportion of the report’s categorization class is equally split among all its indicated counterparts in the OS-K ontology. After such adjustment step, it becomes possible to represent the report’s findings in OS-K ontology and derive a direct comparison with forum-based analysis results.

Table 5 Official vs OS-K trade barriers

More specifically, let Cj be the proportion of the j.th in the statistical report. Let ı(j) be the set containing OS-K categories that can be equally be mapped to the j.th report categorization. Then the OS-K correspondence is constructed according to expression Eq. 5, where the sum is carried out over all proportions of report’s categories that contain the underlined OS-K class in their mapping as per Table 5Footnote 3:

$$ OSK_{i} = {{\sum}^{N}_{j}}\frac{C_{j}}{\|\imath(j)\|} $$
(5)

Next, to normalize each proportion score, we multiply it by 100 divided by the sum of all proportions. Figure 9 shows the normalized scores of trade-barriers occurrence and their comparison with forum-data analysis results.

Fig. 9
figure 9

The OS-K vs the official reported trade barriers between Russia and Finland

Figure 9 shows a clear discrepancy score of the barrier type proportion between the statistical report and the analysis outcome. Nevertheless, some overall trends can be highlighted.

  • From the Russian perspective, there is a rough indication that Product specification and Direction (discrepancy between import and export tariff) barriers achieve close scores between statistical report and forum analysis outcomes.

  • Similarly, from the Finnish perspective, Import restraints, Purpose and to some extent government participation in trade are the three main trade-barrier types that show better agreement between statistical report and forum-content analysis.

  • Finnish and Russian official statistics have almost equally reported several trade barriers (Direction, Product requirement, Purpose).

  • Looking at the highest statistical reported barriers, one recognizes an agreement between Russian and Finnish data where Customs and entry procedure class takes the most significant share. In the second position, Rates class is shared between Russian and Finnish outcomes as well.

  • Comparing the proportion of the Tariff and Non-Tariff barriers, we report a close distribution between Russian and Finnish forum data analysis outputs, where (55%, 45%) and (58%, 42%) are approximately assigned Tariff vs Non-Tariff scores in case of Finnish and Russian data, respectively. This surprisingly closely compares to official report finding as well where the distribution (54%, 48%), (57%, 43%), for Finnish and Russian cases, respectively.

Therefore, in overall, the comparison between forum-based analysis and statistical report findings reveals salient characteristics, confirming the timely endurance of trade-barrier, especially with political tension.

7 Implications

This research has twofold research implication: scientific implication and business implication.

Scientific implication

of this research concerns the contribution to the NLP-based research, which, we demonstrated that it is still is in its infancy phase to answer our research questions. In overall, i) a new ontology for fish business data collection, ii) a new deep-learning model that utilizes the state-of-the-art BERT architecture to match barrier-type definitions and negative polarity posts, and, iii) a new reverse-engineering based clustering method for causality identification have been devised and successfully implemented. Due to its intuitive ground, sound interpretation, theoretical foundation and open-source implementation, we expect the platform to easily accommodate other business sectors and find its way to a broad variety of other business applications. Business implications of this research lies in the critical importance of trade-barrier identification for new investors and small-to-medium scale enterprises where it is commonly acknowledged that the official economical statistical and insurance review report are lacking many unforseen and under-reported genuine barriers, which for various reasons can be overlooked by insurer and audit experts. Such a relevance is also rooted back to the importance of learning from peers in business sectors, where feedback from local SMEs is of paramount importance to either encourage or discourage other economical actors to pursue cross-border business adventures and internationalization. This provides regulators and national agencies with useful insights to reshape policy documents and engage in a more evidence-based trade negotiation. For instance, high and unfair competition and rigid regulations can lead to the bankrupt of the SMEs and economic long run failure (York, 2018). On the other hand, with the dominance of the mutual distrust and political escalation as in recent Russian-European trade relations, the reliance on official reporting for trade-barrier analysis is likely to be biased. Therefore, the contribution of any alternative evidence based collection will be of paramount importance for both economical operators and national agencies.

8 Conclusion

This paper introduces an efficient machine-learning-NLP approach for automatic trade type identification from business forums, contributing towards populating NLP technology in business intelligence mining. The architecture pipeline includes three key components. First, a three-leg ontology has been devised to collect relevant data from Finnish and Russian fish business forums. Second, a BERT architecture is used to match each trade-barrier type’s definition in the (OS-K) ontology, distinguishing eleven Tariff and Non-Tariff barrier types to negative polarity posts of the discussion forums. An automated thresholding similarity-based matching technique is then employed to quantify the extent of matching between each negative post with all trade-barrier types. Third, a TF-IDF vectorization and k-means clustering reverse-engineering based method is used to grasp the cause(s) behind each trade-barrier occurrence. Visualization using WordCloud and trend comparison with official statistics from Finnish Trade Ministry are employed to demonstrate the feasibility and technical soundness of the proposal, enabling us to identify latent characteristics of trade-barriers that goes beyond time frame constraint. The proposal paves the way for further investigations of other business-related intelligence that can be employed from the overwhelming increasing user-generated content from specialized web forums or blogs. The analysis of the findings suggests the dominance of the Time-length barrier type in the Finnish discussion forum dataset and import vs export tariff discrepancy and product requirement barrier types in the Russian forum dataset. Such a finding overlaps to a large extent with official statistics where Non-Tariff related barriers, especially product specifications, take the highest share among the reported barrier types. The study fills the gap of the urgent need to sustain mutual trust in cross-border collaborations. The findings can also be cast into a generic framework of the recommender-system umbrella. Indeed, the platform and results can assist companies or regulators in providing business-related recommendations to overcome the detected trade barriers regarding the target country trading policies for instance.