1 Introduction

E-commerce platforms have become more and more popular over the years. The COVID-19 epidemic further stimulated public interest in e-commerce, resulting in a boom of e-commerce businesses (Gao et al. 2020). As a result, the e-commerce industry has become more competitive, driving firms to make considerable expenditures to improve their platforms.

E-commerce platforms organize their products into categories to make it easier for users to find what they intend to buy. Proposing the right category for a product based on its description can be difficult and time-consuming for e-businesses. The incorporation of multi-level categorization increases the complexity of this task. For example, “mega size original chips” can be classified under snacks category, salty snacks subcategory, and chips segment. As the number of items offered on an e-commerce platform grows, keeping track of a plethora of categories in different levels becomes increasingly challenging. In this regard, the goal of the product category classification models is to automate the process of determining the appropriate category for a given product using the available information such as product title and description. These models can be used for various purposes in practice.

Instead of painstakingly reviewing all categories to identify the appropriate one for each item, the retailers may utilize product category classification algorithms to automate the item classification. This can be particularly helpful when categorizing the new products that are introduced to the retail store. These models can also be used to revise the biased or repeated category levels currently existing in the system. Additionally, product classification models can be helpful in enhancing the marketing strategies and analytics, e.g., by efficiently processing the collected product information from the market place.

The product title classification problem appears to be a variant of the well-studied text classification problems at first glance. Although there are some commonalities, text length distribution, and the structure of the short titles distinguish the product category classification problem from the others (Yu et al. 2012).

Accordingly, different techniques have been developed for this problem to extract the most information from available brief textual information. Employing word-level and character-level features to obtain fine-grained subword information (Wang et al. 2017), context-relevant concept word embeddings (Xu et al. 2020), data augmentation (Rosario 2017), and word-cluster embeddings (Shen et al. 2018) are examples of such techniques.

An online store can benefit from product classification in three ways. First of all, automatic classification of items can aid customers in navigating online retail platforms. A high-quality product classification leads to a more effective and satisfying user experience. Secondly, it enables online merchants to better manage their sales and marketing activities. They can quickly introduce new products to their system and monitor aggregated data of different product categories instantaneously. Lastly, online merchants may identify and follow the products offered by other online stores. They can monitor collected data on the status of various product categories of their competitors using the recommendations of product title classification algorithms.

The problem of multi-level product category classification based on textual product information can be formally defined as follows. Let \(\mathbb {X}\) be the document space comprising textual product information, and \(\mathbb {C}^{\ell } = \{c_{1},c_{2},\hdots , c_{K}\}\) be the fixed set of K classes for category level \(\ell \in \mathbb {L}= \{1,2,\hdots ,L\}\). A text classification model γ maps the documents (i.e., products identified by textual data) to classes (i.e., product category values), that is, \(\gamma : \mathbb {X} \rightarrow \mathbb {C}^{\ell }\) for any given \(\ell \in \mathbb {L}\). As such, our classification problem can be described as determining γ such that \(\gamma (\mathbf {x}) \in \mathbb {C}^{\ell }\) is the product category for level that is most appropriate for \(\mathbf {x}\in \mathbb {X}\). Note that this problem can also be posed as a multilabel classification task where γ classifies L category values simultaneously. Table 1 shows sample data instances for a product classification task, which contain product titles and their related category, subcategory, segment, and subsegment tags. For instance, accurate classification models would be able to classify the product with title “Lipton Yellow Label Bulk Tea” as Beverages (Category), Tea & Hot Drinks (Subcategory), Bulk Tea (Segment) and Black Tea (Subsegment).

Table 1 Sample product titles and four-level category labeling (brand names are italicized)

1.1 Research goals

We investigate the problem of multilevel categorization of the grocery products based on textual product information mainly consisting of product titles.

We explore the limits of text classification methods for grocery product title classification tasks through our detailed numerical study. We examine a variety of machine learning approaches based on Natural Language Processing (NLP) techniques. The methods investigated in this study may help better classification/categorization of new items, the consolidated monitoring of other vendors’ products, and the diagnosis of improperly labeled products in the system.

1.2 Contributions

We summarize the contributions of our study as follows.

  • To the best of our knowledge, our work constitutes the first study on multi-level classification for predicting grocery products categories.

  • We compare and contrast six different text classification algorithms in this task. We examine different techniques such as utilizing product titles and applying three dynamic masking approaches for infeasible sub-levels for transformer models.

    Different from the previous works, we leverage bi-lingual models to improve prediction performance based on Turkish and English product titles. By investigating the conventional and state-of-the-art NLP techniques, our analysis offer a robust baseline for the product title classification task.

  • We conduct a detailed numerical study using six distinct datasets acquired from different online retail stores to test the generalizability of the prediction models. Furthermore, we identify the challenges involved in grocery product title classification through a detailed analysis of the model predictions. This thorough analysis showcase the capabilities and limitations of various NLP techniques for product title classification, and contribute to further methodological and empirical work in this field.

1.3 Structure of the paper

We review the literature on hierarchical product category classification in Section 2, followed by the methodology and dataset discussion in Section 3. We report our findings for within- and cross-platform product category prediction in Section 4. This section also provides a discussion on incorrectly classified instances and underlying causes for such misclassifications. Lastly, we summarize our work, acknowledge and address the study limitations and discuss future research directions in Section 5.

2 Literature review

Hierarchical product category classification is a challenging task as it requires product instances to be accurately assigned to multiple levels of categories based on limited information such as short product titles and brand names. Over the past years, the interest in this problem has increased with the growth of online shopping and the availability of large datasets.

Yu et al. (2012) provided one of the first studies on this research problem. They conducted an extensive numerical study to illustrate how linear SVMs could be used for large-scale multi-class title classification, and identified the differences between product title classification and generic text classification problems. They used a dataset from a large internet company, which contains 29 classes.

They also compared the effectiveness of different feature representations. Their numerical study showed that stemming and stop word removal are harmful, and bigrams are effective for the product title classification task.

There have been significant advances in NLP models over the past decade. For word representations, methods such as Glove (Pennington et al. 2014) and Word2Vec (Mikolov et al. 2013) became increasingly popular. More recently, advanced NLP techniques such as BERT (Devlin et al. 2018), RoBERTa (Liu et al. 2019) and XLM (Ma et al. 2020) have been shown to achieve state-of-art performance for many language tasks. These models, also known as pretrained language models (PTMs), are used slightly differently compared to the previous machine learning-based NLP models. They are first trained on large-scale unlabeled corpora to leverage a good understanding of natural language. Then, depending on the task, a few layers are attached to the end of the “pretrained” base model. Afterwards, the full network is fine-tuned end-to-end on a smaller task-specific corpus. There are multiple advantages of using PTMs over the traditional methods. Firstly, the PTMs can generate contextual embeddings which are more representative of the input sentence. These embeddings not only represent the semantic meaning, but also capture the contextualized meaning of the inputs. Secondly, since the PTMs have a large number of parameters (exceeding 100 million), they can be trained on very large datasets without overfitting. Thirdly, the same PTMs can be adapted for many NLP tasks with computationally inexpensive task-specific training. For instance, after training a PTM for a text classification task on a biomedical domain, the same model can be used for a question answering task on the same domain by only changing the last layers of the model and retraining on a small task-specific dataset. Finally, PTMs require less hyperparameter tuning as they are less sensitive to the changes in hyperparameter values and random initial weights compared to most deep learning architectures. Note that training PTMs from scratch is generally a computationally expensive process due to their large number of parameters. However, many PTMs are open-sourced and publicly available.

Accordingly, it is possible to load weights of a PTM, and fine-tune the model on a smaller task-specific dataset to lessen the computational burden (Devlin et al. 2018).

The adoption of pretrained language models can also be seen in the most recent work on product category classification (Tagliabue et al. 2021; Zhang et al. 2020). Most of the recent literature on product category classification problems can be found in the “Semantic Web Challenge” competition and case studies published by the competing teams (Zhang et al. 2020). The second part of the challenge focuses on multi-level product category classification. The considered dataset in the competition contains more than 15,000 product instances randomly sampled from 702 vendors’ websites. The products are labeled in GPC hierarchyFootnote 1. As baseline models, teams tested the same configuration proposed by Tagliabue et al. (2021), which uses the FastText algorithm. For evaluating the results, standard metrics such as Precision, Recall, and F1-score are used, and to measure the overall performance, Weighted-Average macro-F1 (WAF1) scores are reported by each of the participating teams. All the top submissions ended up using variants of the BERT (Devlin et al. 2018) architecture. For instance, Zahera and Sherif (2020) (the Team DICE) proposed a multi-label BERT architecture called ProBERT for the multi-label product category classification. ProBERT contains fully-connected neural layers with Sigmoid activations for each classification task. The winner of the competition, Yang et al. (2020) (Team Rhinobird) proposed a slightly more complex method, which uses BERT as the base model. To obtain a semantically rich representation, they used hidden states from the last hidden layers of BERT, resulting in 17 different BERT models. These models were then combined using a two-level ensemble strategy. In the first level, they applied 5-fold cross-validation (CV) by splitting the training data into training and validation sets. Then, they trained the same model five times, each time using a different fold as the validation set and the remaining folds as the training set. Afterwards, they averaged the probability outputs of these five models with the same model architecture but trained on a different dataset. In the second level, an ensemble of 17 different BERT models was created where each model votes for the prediction, and the most voted class is selected as the final prediction. Moreover, they proposed a Dynamic Masked Softmax function that explicitly considers the dependencies among different category levels (Zhang et al. 2020). The dynamic masking of the subcategories based on the predicted category reduces the complexity of the optimization problem by filtering out the child categories unrelated to the predicted parent category.

3 Methodology

We provide details on the datasets and the methods employed for our multilevel product categorization task in this section. We also briefly discuss the experimental setup, evaluation metrics and hyper-parameter selection.

3.1 Datasets

Our datasets are mined from Turkish online grocery marketplaces. We extracted product information from seven online grocery retailers as can be seen in Table 2. As the labels were inconsistent, a team of researchers manually explored the cases where the categories can be unified. Note that, to perform cross-platform classification, we employ one platform as the training set (i.e., Baseline DB), while the others serve as the test sets. This approach is mainly adopted to conform with the practical use cases for the product category classification task. For instance, an online retailer can use their own dataset to train a classification model, which can then be used to classify the products of other retailers. This way, the company can gather information about the products sold in the marketplace by type/category. Moreover, using the dataset from a single platform as the training set helps assessing the generalizability of the classification models. Most of the datasets have two category levels (category and subcategory); however, we also consider a 4-level categorization for the products, which additionally includes segment and subsegment information. We limit products in the testing sets to the ones whose categories and subcategories are present in the training set. Table 2 lists the number of items, categories, subcategories, segments, and subsegments before and after the data cleaning process. Only one dataset (4-level DB) has all four labels, and we use it for a more in-depth analysis of the product category classification models.

Table 2 Dataset descriptions including number of items as well as number of category (Cat), subcategory (Subcat), segment (Seg), and subsegment (Subseg) values

The distribution of product title length, employed as the independent attribute in our classification task, is depicted in Fig. 1a. Titles are usually short, with an average length of 6.6 words, and follow a similar distribution across different datasets.

Fig. 1
figure 1

Comparison of product title lengths of the datasets

We note that this short textual information may make the learning process more challenging. We also investigate the similarity of the product title length distributions for all the datasets. Specifically, we employ the Kruskal-Wallis H test, a rank-based nonparametric test with the null hypothesis that the medians of all groups are equal. We test it for the significance level of 0.05, and with the p-value approximating zero, we reject the null hypothesis. We further pinpoint which specific groups significantly differ from the others using Dunn’s post hoc test. Figure 1b shows the p-values of pairwise comparison of product title lengths for different sets. We reject the null hypothesis that there is no difference between groups at the significance level of 0.05. The white grids indicate the significant difference between the distributions of the datasets. We note that, in many cases, the product title lengths differ between products, making cross-platform category prediction more difficult.

3.2 Baseline text classification models

We employ standard text classification models, namely, XGBoost, Support Vector Machine (SVM), and Long Short- Term Memory networks (LSTMs). XGBoost is a scalable tree boosting approach, which generates a collection of weak trees by incorporating instances that have the most contribution to the model’s learning process. On the other hand, SVM, a popular approach in text categorization (Goudjil et al. 2018; Wang and Qu 2017; WibowoHaryanto et al. 2018), performs the classification by identifying the support vectors (i.e., data points closer to the separating hyperplane) that separates the data instances. These points are chosen such that the margin of the classifier is maximized. These two models cannot handle textual information directly and require a numeric representation of it. Accordingly, we employ two approaches for transforming product titles to numeric vectors. Firstly, we utilize word embeddings to convert each word to a numeric vector. Specifically, we use TF-IDF values of each word as its weight, and apply a weighted average on the whole title using TF-IDF and the word embeddings (Stein et al. 2019). Secondly, we use Sentence-BERT to convert each title directly to its numeric representation (Reimers and Gurevych 2019). BERT has the advantage of considering the semantics of the vocabularies around the targeted word.

LSTMs are specific recurrent neural network architectures. They can model both long and short-term impacts of any textual information by utilizing the input, output, and forget gates in LSTM cells. Because of their capacity to maintain new or relevant information when needed and forget old or irrelevant information, LSTMs are appropriate models for text classification problems. We employ a Bi-directional LSTM (BiLSTM) model in our analysis, which has the ability to learn textual input from both directions, and then integrate it using convolutional neural networks to create a single expression (Li et al. 2018). We train separate networks for the four prediction tasks. That is, we feed in the product titles to these four networks, and the corresponding network returns the associated prediction for category, subcategory, segment or subsegment.

3.3 Pretrained language models

We fine-tune the large pretrained language models after attaching fully-connected layers to the output layer for the category, subcategory, segment and subsegment labels. Before the fully-connected layers, the hidden layers of the pretrained language model is concatenated to obtain a detailed contextual representation of the model input (e.g., product title). Softmax activations are used after the fully-connected layers to obtain the class probabilities. We use the Dynamic Masked Softmax (Yang et al. 2020) for the subcategorical labels to enhance the classification performance. Three large pretrained language models with multi-lingual embeddings are considered in our study, namely, BERT, XLM, and XLM-RoBERTa.

3.3.1 BERT

BERT is a well-known pretrained language model. It obtains a detailed contextual representation by jointly conditioning the input on both left and right context in all layers. After unsupervised pretraining on a large dataset, the BERT model can be fine-tuned by adding an output layer to generate state-of-art NLP models for a variety of tasks. For our experiments, we use two versions of this architecture. First architecture is the multi-lingual version of BERT pretrained on 102 languages with the largest Wikipedia using a masked language modelling (MLM) objective (Devlin et al. 2018). Second architecture is the Turkish language version (Oluk 2020), pretrained on 200 GB of Turkish text.

3.3.2 XLM

Conneau and Lample (2019) use a novel training objective that allows unsupervised training for learning high-quality cross-lingual representations. Specifically, XLM uses the Translation Language Modelling (TLM) objective in addition to the BERT’s MLM objective. In TLM objective, the same sentence in two different languages are concatenated. Then random masking is applied in the concatenated sentence in the training phase. The model uses the surrounding context in the same language as well as the context given by the second language to predict a masked token. Accordingly, the XLM model can learn superior cross-lingual word embeddings by using both the MLM and the TLM objectives. We employ the multi-lingual version of this architecture for in our analysis, which has been pretrained on 102 languages with the largest Wikipedia dataset (Devlin et al. 2018).

3.3.3 XLM-RoBERTa

Liu et al. (2019) propose RoBERTa architecture which involves a number of modifications to the original BERT architecture that improves the prediction performance on a variety of NLP benchmarks. The main changes include altering certain initial hyperparameters, removing the next-sentence prediction objective, and training with higher batch sizes. We employ the multi-lingual version of this architecture pretrained on 2.5TB of CommonCrawl data in 100 languages using a masked language modelling (MLM) objective.

3.4 Dynamic masking

In the standard classification approach for product categorization using pretrained language models, the last hidden states of the base model are concatenated to obtain an encoded representation of the inputs. Then, feed-forward network layers with Softmax activations are applied to compute the probabilities for the categories of each level. However, the standard Softmax layer does not take into account the dependencies between the parent (e.g., category) and child (e.g., subcategory) categories.

An alternative method proposed by Yang et al. (2020) uses Dynamic Masked Softmax to dynamically filter out the child categories that are unrelated to the current parent category (see Fig. 2). For instance, if the category ci is predicted by the model, the model should only recommend subcategories that fall in the same category. Hence, infeasible subcategories should be filtered out using a binary mask. In this method, we define a binary mask matrix M ∈{0,1}C×S, where C is the number of categories and S is the number of subcategories. We then compute the Dynamic Masked Softmax instead of regular Softmax for computing the subcategory predictions as

$$ P(y_{s}|c,\theta) = \frac{\exp(O_{s}) M_{c,s} + \exp(-8)}{{\sum}_{s'=1}^{S} \exp(O_{s^{\prime}}) M_{c,s^{\prime}} + \exp(-8) } $$
(1)

where c and s correspond to category and subcategory labels, respectively, 𝜃 denotes the model parameters, and ys denotes the predicted probability for the subcategory s. Note that \(\exp (-8)\) terms are included for numerical study. This design can also be extended to more than two levels as needed.

Fig. 2
figure 2

Masked XLM, XLM-RoBERTa and BERT (Yang et al. 2020)

In the dynamic masking approach, the prediction of a sub-level category depends on the prediction of its parent level category. For instance, predicting the subcategory “Tea” depends on predicting the associated main category “Beverage” correctly. In our numerical analysis, we experiment with three different configurations of dynamic masking and compare these variants with the standard approaches. For all three configurations, we first extract the hidden layers that make up the transformer architecture, and concatenate them. Then, we apply feed-forward (Dense) layers and Softmax activation functions for each multi-class classification task.

In the vanilla configuration, the information about the parent level prediction is not carried to the child level prediction. These two prediction tasks are executed independently. Specifically, the input to the feed-forward network layers of each category level is the hidden state of the base model (Hc), as shown in configuration 1 in Fig. 2. As an alternative, it is possible to provide additional information about the parent level’s prediction for the child level’s prediction. For this purpose, we can concatenate the hidden state of the transformer model with the parent level’s output before predicting the child level. This information can be provided in two ways. In the first configuration, the output of the parent level’s feed-forward layer (Densep) before the activation is provided to the child (Denses). In the second configuration, the output after the softmax activation (Softmaxp) is provided.

3.5 Experimental setup

Figure 3 depicts our experimental setup consisting of two parts: within- and cross-platform grocery product category prediction. We first investigate the performance of the models in predicting the product categories (Part I). Accordingly, Baseline DB described in Section 2 is used as the training set, and it is subjected to 5-fold CV. Then, we conduct experiments to explore the performance of different models and word embeddings. Furthermore, for a more comprehensive evaluation of the dynamic subcategory masking for the pretrained language models, we train these models again on a larger dataset.

Fig. 3
figure 3

Experimental setup

For this section, we use 4-level DB dataset, which contains more than 24,000 products categorized to four levels. Secondly, we use the best performing models in Part I to determine their generalizability as indicated by their performance for predicting the products of other online platforms.

The analysis with the cross-platform prediction also helps understanding the capabilities of the models in predicting new possible products in the system.

3.5.1 Evaluation metrics

We utilise Brier score, accuracy and the weighted-average macro F1-score (WAF1) to evaluate the classification performance. Brier-score is a common accuracy measure for probabilistic predictions. It is calculated for each level by measuring the mean squared distance between the predictions and the binary truth vectors, with larger differences reflecting higher error in the prediction. Thus, a lower score indicates better performance. Brier score is calculated as

$$ \text{Brier-score} = \frac{1}{N} \sum\limits_{i=1}^{N} \sum\limits_{k=1}^{K} (f_{ik} - o_{ik})^{2} $$

where N is the number of samples, and K is the number of classes. For a given instance i ∈{1,…,N}, fik represents the predicted probability for class k, and oik values encode the ground truth vector (k th value being one and others being zero). On the other hand, accuracy is a simple statistic that indicates how often a model makes a correct prediction. Lastly,

F1-score is calculated as the harmonic mean of precision and recall, that is, \(\text {F1-score}=2 \times \frac {\text { precision } \times \text { recall }}{\text { precision }+\text { recall }}\).

These values are determined separately for each class, and WAF1-scores are derived using a weighted average. Similar to Zhang et al. (2020), we rank the models by taking the average WAF1 for all category levels.

3.5.2 Parameter settings

Tensorflow and Transformers libraries are used to implement the BiLSTM and pretrained language models (Wolf et al. 2020). We fine-tune the pretrained transformers bert -base-turkish-uncased, bert-base-multilingual-uncased, xlm-mlm-100-1280, and jplu/tf-xlm-roberta-base for the classification tasks. We utilize Adam optimizer and a batch size of 16 for all pretrained language models. Early-stopping is used during the training phase to minimize over-fitting.

The weights of the best-performing model on the validation set are saved and used for testing.

We also perform a grid search to fine-tune the hyperparameters. As mentioned in Sections 4.1 and 4.3, this procedure is carried out via a distinct validation set. Table 3 shows the parameter search space for each model. The best combinations of hyperparameters are in bold. We observe a high sensitivity in the SVM and XGBoost to their parameters, whereas LSTM and transformers are less affected by the change in their network-related parameters. We use scikit-learn and xgboost packages in Python for implementing the SVM and XGBoost models, respectively.

Table 3 Model parameters and search space

4 Numerical results

In our numerical study, we first focus on identifying the best performing models on the baseline dataset. Next, we assess the impact of dynamic masking approaches for the multilevel product categorization task. We examine the generalizability of the model performances to different datasets extracted from various online grocery retailers. In this experiment, classification models are trained with the two-level baseline dataset (i.e., with category and subcategory labels), and tested on six two-level test sets. Lastly, we demonstrate misclassifications for sample data instances, and discuss the underlying causes and remedies to improve classification performance.

4.1 Performance comparison

We examine alternative word embeddings and a wide range of multi-class classification methods, including classic machine learning algorithms (e.g., SVM and XGBoost), as well as more contemporary deep learning methods (e.g., BiLSTM and BERT) for the multilevel product categorization task. Using a 90-10 division over the training set, we create the validation set and optimize the model parameters accordingly. Then, using the stratified CV, we split the dataset into five folds, where one fold is used as the test set and the remaining as the training set. This process is repeated five times for all folds. This approach enables identifying appropriate parameters and minimize overfitting.

Table 4 illustrates the performance of each model and word embedding settings as obtained by 5-fold CV.

Table 4 Comparison of different models and word embeddings using Baseline DB (performance values reported as avg. ± stdev, which are obtained using 5-fold CV results)

We find that Turkish Glove embedding provides a superior numeric representation for the vocabularies utilized in our dataset when compared to Turkish FastText, Sentence-BERT, and Turkish Word2Vec embeddings. BiLSTM with Turkish Glove embedding leads to best overall performance among the standard text classification methods as indicated by Brier-score (avg. Cat: 0.06, avg. Sub: 0.14), accuracy (avg. Cat: 96.6%, avg. Sub: 90.6%) and F1-score (avg. Cat: 96.5%, avg. Sub: 90.1%). Pretrained language models mostly perform similarly with BERT-based models performing slightly better than others, followed by XLM and XLM-RoBERTa models as indicated by the average F1-scores. Surprisingly, the vanilla pretrained language models outperformed the masked configurations of the models for the two-level product classification task.

Overall, these results show that the vanilla BERT architecture with multilingual embedding performed the best among all considered classification model-word embedding pairings, closely followed by other transformer architectures.

4.2 Impact of dynamic masking

We investigate the effects of dynamic masking on multi-level product category classification in a more detailed manner considering the four-level classification task.

Table 5 shows the results for the pretrained language models with three masking strategies obtained by using 4-level DB.

Table 5 Comparison of dynamic masking approaches using 4-level DB (performance values reported as avg. ± stdev, which are obtained using 5-fold CV results)

We find that the BERT with Turkish Embedding performed the best, with a marginal improvement over the BERT with multilingual embedding. We also see that the rank between the masking configurations are consistent across all transformer architectures in terms of the average F1-score. The non-masked configuration performed best, followed by the “Mask”, “Mask After Activation”, and “Mask Before Activation” configurations.

The results shown in Tables 4 and 5 indicate that the masking mechanisms do not necessarily improve the classification performance of the model. This result can be due to the availability of semantically similar child categories under different parent categories. In such cases, when the masked models are used, if the parent category is incorrectly predicted, the child categories will be misclassified as well. This is because all the subcategories that are under a different category are masked. However, in the non-masked models, the model can still predict the child category correctly, even though the parent category is predicted incorrectly. We present mispredictions from the validation dataset using the non-masked BERT model (BERT-TR) in Table 6 to better illustrate this phenomenon. In these examples, we observe that the child category prediction can be correct even though the parent category prediction is incorrect. For example, in the first instance, a subcategory for a liquid soap product is misclassified as “Cosmetics” instead of “Soap”. However, the remaining category levels are predicted correctly.

Table 6 Examples of the non-masked model mispredictions obtained by BERT-TR (texts are translated from Turkish to English, wrong predictions are italicized)

Table 5 provides other evidence on why masked versions might not perform as well as the vanilla models.

For instance, we observe that the average reduction in the child-level category accuracies are higher when masked configurations are used, indicating that masking leads to additional performance deterioration when predicting subcategorical labels. It is worth noting that there are some exceptions to this such as subsegment of the BERT-ML model. We believe these exceptions can be explained by the randomness introduced by the mixed training objective.

Overall, we find that the use of dynamic masking has certain benefits and drawbacks. It reduces the complexity of the prediction task by creating a mask over the sub-levels which do not belong to the predicted parent category. This helps achieving higher prediction performance when the number of classes are high for that category. On the other hand, when the boundaries/distinction between the categories are not clear in the dataset, the use of dynamic masking can have a detrimental affect on the model performance. This mainly stems from the fact that when the dynamic masks are used, a wrong prediction would mean all the sub-level predictions will be incorrect. In cases where the parent category predictions are not accurate, the misclassifications can compound at each level, leading to a lower sub-level prediction performance. In brief, the dynamic masking is most useful when the classes are strictly separated, the higher-level categories are accurately predicted, and the number of categories at the sub-levels are high in the dataset. Accordingly, the model design needs to take into account these factors before implementing a dynamic masking configuration.

4.3 Cross-platform multilevel classification

We next discuss our findings with the cross-platform classification task.

Note that certain categories and subcategories in the extracted datasets may not exist in the training set (i.e., Baseline DB). Accordingly, we only take into account the items whose categories and subcategories are available in the training set. We use the models with the best hyperparameters described in Section 3.5.2 to perform the predictions. We employ the models trained in each of the 5-fold CV process to predict the labels in the test sets, which helps achieve a robust evaluation. Therefore, we report the average and standard deviation over the performance values for each test set obtained by five prediction models (see Table 7).

Table 7 Models’ performance of cross-platform category prediction for competitor analysis

Our analysis in Section 4.1 shows the best performing models for within-platform product categorization as BERT-ML, BERT-TR, and XLM-ML. In cross-platform prediction, surprisingly, BiLSTM outperforms or performs similarly to the BERT and XLM variants. The Brier-score and accuracy metrics rank the models on the same order. For Test Set-2 and Test Set-4, BERT-TR is the best performer, while for the remaining four datasets BiLSTM provides best performance values. F1-score metric finds BiLSTM as the best model for all six datasets. Overall, while within-platform performances of the models given in Table 4 are relatively close, there is a higher performance variation on cross-platform prediction.

Overall, the cross-platform product categorization outcomes show that the models trained on a comprehensive training dataset may be applied to the ones collected from other online grocery stores. However, we note that the product titles in each of the test sets are different, and, as the difference between the training set and the test set grows larger, the cross-platform performance deteriorates even more.

4.4 Discussion on model predictions

Lastly, we visually investigate the predicted category levels and the ground truth to find the items for which the models fail to predict the proper category and subcategory values. Table 8 shows sample data instances, the ground truth values and the predictions from BERT-TR model. We summarize our general observations on the misclassifications as follows:

  • Should a product exist in the test set and not in the training set or contain different phrasing from the training set, a miscategorization may happen.

  • The model’s forecast is influenced by the general connotation of some brand names. For instance, “Raffaello” is a brand that produces chocolate. However, Raffaello is also a type of pasta. Accordingly, for the last instance in Table 8, the model predicts category and subcategory labels as “Pasta” and related labels, whereas the ground truth is related to “Chocolate”.

  • Generally, product categorization is done manually and thus it might be subjective. For example, one online store may classify a product as a dairy product, while another may categorize it as a beverage. Such a problem cannot be solved during the data cleaning process since we encounter a long list of product titles and category levels. A manual inspection may, nevertheless, result in subjectivity. Hence, we did not revise such category/subcategory labels in the data cleaning phase.

  • Some product names might be associated with more than one category. A book on cooking, for example, may be classified as food rather than a book.

  • In a four-level product category classification task, the finest level can be too specific and difficult to be predicted. Therefore, more data instances might be needed to have a better fit on each category level.

Table 8 Examples of product category mispredictions obtained by BERT-TR (texts are translated from Turkish to English, wrong predictions are italicized)

We observe that BERT-TR fails to predict the exact category or subcategory for some items, as shown in Table 8. Investigating failure cases can further reveal the reasons for misclassifications. “Dentiste Sensitive tube 100 gr” is a regular toothpaste that is imported to Turkey. The title does not have enough information other than “sensitive” and “tube”. The model predicted the item as a shaving cream since shaving creams are presented in tube form and they are usually advertised for sensitive skins. For the product “Green Life fennel tea 150 gr Bag”, the model predicted category, subcategory, and segment correctly, whereas it failed in the finest level possibly due to the keyword “Bag” in the title. “Piyale strawberry pudding 125 gr” is categorized as milk dessert by the model while actual label is pudding. This misclassification can be justified by the fact that pudding is prepared by using milk. As such, a fresh pudding can be classified under either group. Such a delicate categorization could also be the reason why “Begdat bay leaves” are misclassified in the subcategory level. Raffaello is a brand that produces both snacks and it is also a type of pasta. Therefore, with such a short title of “Raffaello 150 gr”, the model categorized it as spaghetti rather than milky chocolate.

We note that a retail company benefits from such detailed examinations by unifying similar subcategories, merging and revising misleading or subjective category levels, and expanding the product titles in their websites by adding more product information. This also benefits the search engine optimization of a company’s online grocery stores. Accordingly, providing more information about the product’s ingredients than a simple title may enhance prediction accuracy. Currently, many grocery websites in Turkey do not include those detailed information. Overall, the model performances are found to be reasonable considering the limited input provided to the models.

5 Conclusion

Companies rarely have complete knowledge about the items available in the marketplace while developing a marketing strategy. As a result, they typically predict the missing items and match it to their category definitions to have a better understanding of the market. Furthermore, they may seek to discover wrongly categorized items based on the products in their database to have a better sense of the current market trends. In this research, we investigated text classification techniques to automate the prediction of product category levels using product titles. We evaluated the extracted datasets of Turkey’s leading online grocery platforms and used several machine learning techniques for the prediction task. We employed three different masking strategies to improve the accuracy of pretrained language models for the multi-level product classification. After conducting a detailed comparative analysis with these models, we closely examined the misclassified products. This analysis revealed many cases where the existing labels (i.e., category values) are debatable or difficult to predict even for human experts.

We observe that tuned text classification methods show high level of robustness for both within platform and cross-platform prediction tasks as evidenced by low standard deviation values around various performance metrics. In addition, our preliminary analysis indicate that deep learning-based classification methods are highly robust to model parameters as well. Low Brier-score values and higher than 90% accuracy and F1-score values indicate that these classification methods can be used in practice to avoid tedious manual categorization, automatically labeling the new products, and examining the large number of products sold by a retailer. They can also be used for enhanced marketing strategies and analytics as they enable categorizing the products in the marketplace, enabling more efficient data collection from other retailers. We note that the use of pre-trained versions of the large language models significantly reduces the computational requirements. That is, training a pre-trained model on a task-specific dataset takes a few hours on a GPU and between 24 and 48 hours on a regular CPU. As such, the product classification methods considered in this study can be employed by many retail companies.

We recognize that our research has certain limitations. While we ensure that we utilize both well-established and novel techniques in text categorization, as the NLP is a fast-changing field, continually analyzing the trends and applying alternative approaches for the prediction task are deemed to be necessary. Furthermore, because many online retailer websites do not provide additional specific information about the products, we only consider product titles for the category classification task. Future research can integrate product descriptions, specifications, and prices to enhance prediction performance.

Designing text classification models to improve the performance in certain product categories might be an interesting future research direction. The trained models, for example, exhibit low accuracy in the “Newspaper & Magazine” subcategory. Fine-tuning the models on an online book store dataset or training a second book/non-book classifier can improve the predictions for this category without compromising performance in other areas. In addition, considering that different models perform better for different cross-platform prediction tasks, an ensemble of models can enhance the prediction performance. Moreover, Stochastic Weight Averaging (SWA) may be used to further improve the performance for unbalanced product category cases. Finally, the semantic similarity of some categories might be problematic for both models and practitioners. Companies may use these models to revise their manual categorization and potential bias or overlap in their product labeling. Occasionally revisiting such information may help to better organize the products on online platforms. Therefore, we aim to expand our empirical analysis as we collect more data from the marketplace.