1 Introduction

Over the past decades, new technologies and the development of online social platforms have made available to researchers a large amount of text data. Consequently, many studies have been conducted with the aim of exploiting the informative content of digital text. Currently, texts are used as data in a variety of applications in favour of social and economic insights: authorship, sentiment, nowcasting, policy uncertainty, media slant, market definition and other topics, as it is witnessed by the review by Gentzkow et al. (2019). Further studies highlight the contribution of text data to different areas of human life such as politics (Jentsch et al. 2020), public administration (Hollibaugh 2019), education (Ferreira-Mello et al. 2019) and several branches of medical sciences (Luque et al. 2019).

With the focus on marketing and business, Reisenbichler and Reutterer (2018) have recently overviewed the wide range of theoretical and applied research based on text as data and have highlighted the major role played by topic modelling. The latter is a class of unsupervised learning methods developed in a probabilistic setting and capable of clustering text documents in a number of, precisely, topics. The most applied topic model is probably the latent Dirichlet allocation (LDA), referring to the Bayesian model developed by Blei et al. (2003), which, in essence, represents each document as a probability distribution over topics and, on its turn, each topic as a probability distribution over words. LDA is a model-based clustering method, related to finite mixture models. It is recognised to be a flexible and versatile tool to analyse text data, and as such, it has afterwards been extended in multiple variants.

As a matter of fact, when individual texts are very short, say in the range of words from one to thirty, as it is the case of data prevalent on websites, such as titles, image captions, questions in Q&A webpages, LDA might generate topics which are not meaningful. For example, it is recognised that LDA does not perform well when applied to short text fragments, such as microblogging, tweets, headlines and product reviews. This is essentially due to sparsity, as in these cases LDA has too few word co-occurrence information. Several strategies have been proposed to alleviate the problem of data sparsity in short texts, either by combining short documents together, or by employing external resources, such as Wikipedia, to overcome the lack of information or, also, by using alternative models better suited for short texts; see the discussion in Cheng et al. (2014), Jipeng et al. (2019), Tuan et al. (2020), Anderlucci and Viroli (2020) and the many references therein. The overall impression is that there is room to investigate whether alternative approaches are somehow more suitable for telegraphic texts.

As an alternative to the topic modelling approach, we explore the class of supervised learning methods that may provide further complementary knowledge to the widely acknowledged LDA for the analysis of short texts. More specifically, in line with the classification recently introduced by Gentzkow et al. (2019, section 3) the methods used in the paper belong to the class of text regressions.

In the context of text regression, the actual problem that applied researchers have to face is variable selection. It is known that text is inherently high-dimensional and sparse, even when telegraphic. Indeed, the set composed by the union of words, and eventually n-grams, is generally huge. As each token, be it a single word or an n-gram, is a feature or a potential predictor, statistical analysis requires methods for high-dimensional parameter spaces. Drawing on the literature where the number of predictors, P, is either larger than, or large relative to, the number of observations, N, analyses of text data have been based upon two main ways of restricting the attention to lower-dimensional subspaces: shrinkage and dimension reduction (Friedman et al. 2008). Both methods have been applied in text regression: the lasso penalised regression and its modified version in Nowak and Smith (2017) are attributable to the class of shrinkage methods; the singular value decomposition used by Foster et al. (2013) and the latent semantic indexing in Deerwester et al. (1990) belong to the class of dimension reduction methods. Another stream of the literature has faced the problem of text regression by means of nonparametric methods, such as an extension of neural networks and recursive trees. The latter methods will not be considered in the paper, as it is often the case that they provide results mainly focused on prediction, that may be hard to interpret (see Minaee et al. 2020 and references therein). Rather, with a focus on interpretation, we investigate the potential of text regression methods when variable selection is performed by recurring to variable selection models.

The objective of the paper is to compare different variable selection procedures when regression models are fitted to real, short, text data. The focus is on lasso-based methods, as the lasso (Tibshirani 1996), along with several of its variants, is one among the most commonly applied variable selection methods, thanks to its flexibility and computational feasibility. LDA for topic modelling is also included, as a term of comparison. The relative performance is compared in terms of complexity and quality. Complexity is related to the number of selected variables and is measured through the predictive \(R^2\) index. Quality, on the other hand, is related to the selection of relevant variables that is variables that give value and explanation of the response quantitative variable. As objective measures of relative performance in terms of quality we select the frequency of inclusion and the model class reliance. The former measures how frequently the same variable is selected over bootstrap replications, and will be accompanied by measures of variability. The latter (Fisher et al. 2018) is a core measure of variable importance related to permutation-based importance measures, applied here in the context of text regression. All in all, these three indicators concur to describe the debated concept of interpretability. Indeed, interpretability in the context of machine learning methods has been defined by Doshi-Velez and Kim (2017), based on the Merriam-Webster dictionary, as the “ability to explain or to present in understandable terms to a human”. Our view is similar in spirit to the proposal expressed by Margot and Luta (2021), who claim that algorithms with high “predictivity, stability and simplicity” are interpretable in the sense of Doshi-Velez and Kim (2017). Our approach investigates similar aspects of different methods, and as a main novelty, we also focus on the ability to detect relevant variables.

Our perspective is empirical and the analysis is carried out on two real datasets, belonging to two fields of application, present in many digital contexts and whose solutions provide results of interest for a wide range of situations.

The first case study affords the issues of explaining variation of prices of goods within the same category, based on the descriptive text or label provided by producers on e-commerce platforms. Solving this task may provide new insights on hedonic evaluation and on price index measurement. Moreover, whether estimated prices were made available, they would decrease informative asymmetries in the markets, by opening information to consumers.

The second case study is concerned with how open questions inserted in a questionnaire may be informative on the overall satisfaction ratings. Datasets of this type are very common. According to the review in Reisenbichler and Reutterer (2018), they are analysed in a large portion of papers; see also Lange et al. (2022), where a bagging approach to unsupervised sentiment analysis is developed and the relation to unsupervised learning based on embedding methods and lexicons is discussed. This research focus is motivated by the growing interest in using text analytics, either of social or of traditionally collected interviews, in order to gain insights about experience or satisfaction. Here, the main rationale is twofold. Firstly, open questions may raise opinions and point of views that could have not been planned by conventional multiple-choice closed questions. Secondly, learning from words contained in open questions to explain satisfaction ratings may open the way to new methods of survey design, capable of replacing traditionally collected questionnaires with pre-defined attributes that are often expensive and require time to be prepared and filled.

The selection of applications necessarily omits many worthy areas of interest, and we do not have the ambition of exhaustively covering the wide range of possible applications. Although in this paper we do not introduce new methodologies to address the issues of short text modelling, we believe that empirical analyses may shed a great deal of light on the effectiveness of text data as explanatory variables in parametric regression models. As a matter of fact, the methods considered in the paper have often been compared and validated only over simulated data. Besides, the two case studies considered in the paper are inherently different in nature, as, in the first application, a concise descriptive text of attributes is considered, whereas, in the second case study, the short text expresses an opinion.

The paper is organised as follows: Sect. 2 overviews the literature on text regression models through shrinkage methods; in Sect. 3, we illustrate the datasets and the text pre-processing for each case study, while in Sect. 4 the design of the study is described along with few technical details. In Sect. 5, we analyse the results on each case study. Concluding remarks are provided in Sect. 6.

2 Modelling high-dimensional sparse text data

Variable selection in regression analysis is an age-old problem in statistics, which currently encountered a renewed interest due to the increasing availability of high-dimensional data. In sparse settings, the focus is to disentangle few meaningful variables, playing a major role for interpretation purposes, from the redundant and noisy remaining ones. A large variety of methods have been developed for sparse high-dimensional regression, with the majority of applications dealing with research in genomics. Alternative methodologies may be grouped in three main classes, with several mixed proposals: penalty-based, screening-based and randomisation-based methods.

Penalty-based procedures encourage sparsity by imposing a penalisation on parameters at the estimation stage. Different penalties, on the norm or concave, and different shapes of the penalty give rise to alternative specifications: for instance, lasso (least absolute shrinkage and selection operator, Tibshirani 1996), Ridge (Hoerl and Kennard 1970) and elastic net (Zou and Hastie 2005) impose penalty on the norm with different shape, while non-negative garrote (Breiman 1995) and SCAD (Fan and Li 2001) impose different concave penalties. One of the most commonly applied methods is certainly the lasso, both for its computational feasibility and for its predictive performance. The lasso is not without limitations as it may exhibit poor variable selection results (Bach 2008). Meinshausen and Bühlmann (2006) find that it tends to select noisy variables when the penalty parameter is chosen to optimise prediction and suggest to recur to alternative criteria rather than cross-validation to identify causal predictors. A principle approach to model selection is provided by information criteria, which search for a balance between the maximised likelihood function and the model complexity, by adding a penalty term related to the dimension of the parameter space. Traditional information criteria are the Akaike information criterion (AIC, Akaike 1974), which guarantees predictive performance, and the Schwarz information criterion (BIC, Schwarz 1978), derived under a Bayesian approach and proved to be consistent in a number of circumstances. Unfortunately, when the parameter space is large, BIC has been observed to be too liberal (Bogdan et al. 2004; Broman and Speed 2002). Thus, Berger (unpublished) suggests the generalised information criterion, which refines the choice of prior distribution, and Meinshausen and Bühlmann (2006) propose a data adaptive tuning parameter procedure. Alternatively, the extended BIC (Chen and Chen 2008) adjusts the prior probabilities by adding a further tuning parameter which accounts for the cardinality of models when the number of covariates increase. Compared to the previously developed criteria, it retains high simplicity.

Screening procedures for variable selection rank predictors by relevance; subsequently, the research will reduce dimensionality by selecting the highly ranked predictors and running the standard variable selection lasso (or its variants) over selected variables. Candes and Tao (2007) propose the Dantzig selector, which is the solution to an \(\ell _1\)-regularisation problem and achieves the ideal risk. A further method is the sure independence screening (SIS) by Fan and Lv (2008). The sure screening is a property according to which all important variables survive to the variable screening with probability tending to one, and obviously, it is desirable that each selection model does possess it. In this method, the variable screening relies to the correlation learning, by ranking predictors according to their marginal correlation with the response variable, and filters out those predictors that have low marginal correlation with the response variable. Predictors are considered one by one, independently. As a result, the method reduces the space of predictors. After that, the original problem of estimating the model may be solved with classical estimators.

Another way to tackle high dimensionality is to resort to methods inspired to the idea of randomisation or consensus combination. The rationale of randomisation is to execute the lasso, or other variable selection algorithms, over repeated samples of the original data, generated by bootstrap or resampling methods, and to average over multiple results, so that the instability of running a selection algorithm only once can be overcome. Several randomisations algorithms have been proposed: Bolasso (Bach 2008) lets lasso select variables over bootstrap replications and keeps the intersection of variables selected over replications; stability selection (Meinshausen and Bühlmann 2010) chooses all the variables that occur in a large section of the resulting selected set. Complementary pairs stability selection (Shah and Samworth 2013) is a variant of stability selection based on a modified bootstrap scheme, which yields improved error bounds and therefore favours the applicability of the methodology. These methods have good consistency properties in terms of variable selection. As a drawback, they are more computationally intensive.

The aim of this study is to identify which of these methods are more effective in selecting the relevant variables. To this purpose, we compare some specifications of the previously presented methods in terms of number of selected variables, by a measure of fit that is the predictive \(R^2\), and importance of the selected variables, in terms of frequency of inclusion and model class reliance. To the best of our knowledge, these alternative variable selection methods have never been compared over regression models based on text as data.

3 Data description and pre-processing

3.1 Case study 1

The first case study focuses on the task of pricing items available on producers’ e-commerce platforms. Our specific aim is to model regular fashion prices within a cross-sectional comparison, as we are not primarily interested in price dynamics. Indeed, fashion goods are seasonal products sold over a finite season. In these markets, retailers often use dynamic markdown policies in which an initial retail price is set at the beginning of the season; then, the price is subsequently marked down as the season progresses, in order to minimise the stock-out risk (Soysal and Krishnamurthi 2012). We use data scraped from UK e-commerce websites of four fashion producer brands. The scraping operation collects records available on the four websites, during the week of one season. The brands are chosen to be highly heterogeneous with respect to the average price of items on sales: two of them are large fast fashion chains selling at very low prices and the remaining two brands find their business on the design of enhanced fashion item. Fast fashion and enhanced design brands are comparatively discussed in Cachon and Swinney (2011), and we argue that our analysis can provide material for a further insights on enhanced design versus fast fashion. Each record collects information on price, category, brand, which we have voluntarily excluded from the specification, and a field of description. The analyses are carried out for the knitwear and dresses categories. A sample from the dataset is reported in Table 1.Footnote 1

Table 1 Sample from the dataset of case study 1

3.1.1 Pre-processing

Before affording the issue of sparse modelling, we process some preliminary steps to reduce the dimensionality and to map raw text into a numerical matrix, the document term matrix (DTM), whose ijth element indicates the count of the jth word or token in the ith document. Some text preparation operations are required in order to process the data and reduce meaningless dimensionality. First, we cancel out non-words elements (like numbers, punctuations and proper names); then, words in a standard English stop-words list are automatically removed, and further contractions, such as don’t or it’s, or misspellings are excluded manually; finally, words are replaced with their roots through stemming. Eventually, stems displaying sparsity higher than 99% are deleted.

After the pre-processing step, the DTM matrix for the knitwear dataset contains \(N=382\) and \(P=229\) words, the dresses dataset \(N=1110\) and \(P=402\) columns. Note that DTM is high-dimensional in column even if not in the row dimension. In both the knitwear and the dresses datasets, each document is composed, in median, by only three words, and each word appears in less than 1% of documents. As a whole, DTMs are highly sparse, with about 99% of empty cells. Some descriptive statistics are shown in Table 2. The response variable, the price, displays a high standard deviation as compared to the average level and shows an asymmetric distribution. In order to assess the robustness of the results, the analysis has been carried out both on the prices and on the log prices. As the results, in terms of average price variability explained by using the log-transformed data, do not significantly differ from the ones obtained based on the original prices and as interpretation of the results is at the core of the hedonic evaluation, we choose to present the results of the analysis on the prices on their natural scale. It is true that the overall variability is smaller when the logarithmic transform of the data is taken. On the other side, relying on the original prices allows us to express a priori judgments on those that may be relevant influential variables.

Table 2 Descriptive statistics for case study 1

3.2 Case study 2

The second case study is related to a section of the Tech Company Employee reviews dataset, downloaded from www.kaggle.com.Footnote 2 Data were scraped from www.glassdoor.com. Glassdoor is a website which allows current and former employees to anonymously review companies and also to anonymously submit and view salaries, as well as to search and apply for jobs on the same platform. The analyses are carried out for reviews about a worldwide tech anonymous company. Each record collects information on company, location, date of the review, job, position. Moreover, it collects evaluations on a 1 to 5 scale on overall-ratings and other aspects concerning the job, along with two further open questions on pros and cons. Words deriving by the positive field, pros, are presented preceded by the prefix p while words deriving by the negative significant, cons, are preceded by the prefix c. Our aim is to explain the overall-rating using only the text contained in pros and cons open questions fields as data. A sample from the dataset is reported in Table 3.

Table 3 Sample from the dataset of case study 2

We shall treat ratings as a response variable in text regression. Though more specific methods can be applied to interval scales and ordinal variables (see, for instance, Hastie et al. 2009, ch. 14), for sake of interpretation and comparison, and in order to effectively perform variable selection, we shall investigate the marginal contribution of attributes by estimating linear regression models. The motivation for a regression analysis on rating is twofold. First, variable selection methods for categorical data are not as developed as methods for continuous variables and thus often do not allow for homogeneous comparisons. Second, ratings are most analysed as quantitative variables, which makes the results of the analysis in the paper to be useful in several related applications.

After a similar text pre-processing to the one described in section 3.1.1, the DTM for the Employees dataset is a \(N=808\) by \(P=1135\) matrix. In this case, we also considered unordered pairs of words with sparsity lower than 99%. In fact, no pair is resulted among the most frequently selected variables; see Table 4. Each document is composed in median by 15 words. Each word, in mean, is present in the 1.7% of documents, and the sparsity of the DTM reaches the 98.3%.

Table 4 Descriptive statistics for case study 2

4 Design of the study

4.1 Bootstrap

To evaluate to which extent relevant variables have been detected is a challenging task, as the true model is unknown but in simulations. In practical applications, only one dataset at a time is given and one does not know which variables are truly influential. To mimic the availability of several datasets, we recur to bootstrap replications and resample (five hundred times) by each same unique dataset. Indeed, the bootstrap may yield desirable perturbations similar to those of multiple data sets (Efron and Tibshirani 1998). The analyses have been carried out for each case study.

Bootstrap is performed using the classical approach of resampling with replacement. Each replication of the original dataset is divided into training and hold-out datasets. We expect that duplicated observations may somehow overestimate the performance evaluation of the specifications at hand, but in a manner which, in the same way, we expect to be uniform across methods. Indeed, we have verified that the order of performance across specifications does not change if simple cross-validation without repetition is used, which, in addition, reduces the size of training and hold-out datasets. For each bootstrap dataset, we first select variables over the training dataset, by using the pool of models described in section 4.3. Then, a linear regression model with the selected variables as predictors is estimated over the hold-out dataset by the method of ordinary least squares.

4.2 Criteria

We assess the relative performance of alternative selection models on the basis of the following indicators: the predictive \(R^2\), the inclusion frequency and the model class reliance.

To compare the models in terms of complexity, i.e. of number of variables selected, we consider the predictive \(R^2\). We remark here that we compute the predictive \(R^2\), not because we aim to evaluate the usefulness of selected variables for out-of-sample forecasting, which is out of our scope, but rather to evaluate how relevant selected words are in explaining the response variable when new potentially similar datasets are considered. As a matter of fact, the datasets we analyse are renewed either at any season (in the case study 1 of e-commerce data) or at any further session of Human Resources evaluation (in case study 2 of employees’ ratings data) and the goal of our study is to identify which methods allows us to retrieve the relevant drivers of either item prices or employees satisfaction, in any further dataset of this type.

The overall quality of each model is assessed through the inclusion frequency and the model class reliance that are essentially indicators of variable importance, both computed at the variable level and averaged at the specification level. The inclusion frequency measures how often variables are selected, over bootstrap replications. Variables that present a high number of repetitions have been selected in several bootstrap samples, implying that the model is robust to perturbations of the data set.

The model class reliance (Fisher et al. 2018) measures the extent to which well-performing model within a pre-specified class may rely on a variable of interest for prediction accuracy. Within the class, model reliance is a core measure of variable importance, in that it tells how much an individual prediction model relies on explanatory variables of interest for its accuracy. For each model m, the model reliance of each variable j, denoted as MR\(_{j,m}\), is computed as the ratio between: the loss function associated with the specification evaluated based on the DTM with permuted jth variable (numerator) and the same loss function evaluated based on the original DTM (denominator). By permuting the elements of the jth variable, it is possible to assess the amount of the loss, when the variable itself is rendered uninformative; see section 3 of Fisher et al. (2018) for further details. As a loss function, we consider the residuals standard error. At each bootstrap replication, the jth variable in the training set is permuted and the model reliance is computed. We then obtain the empirical bootstrap model reliance of each jth variable as the average over the bootstrap replications. Eventually, we obtain the highest model class reliance (MCR) that is the upper extreme of the interval which defines the MCR. Note that the MCR is a measure of variable importance in a given dataset.

In summary, we evaluate the performance of alternative variable selection methods on the basis of: (1) their explanatory power out of the training sample, measured in terms of predictive \(R^2\); (2) the bootstrap inclusion frequency of selected variables; and (3) the ability to select important variables for the specific dataset.

4.3 Models

Several variants of lasso are considered that we shall discuss with few more details in section 4.4. Nevertheless, we prefer to introduce them all here to provide a full overview of the methods and models used in the analysis. First, to tune the lasso parameter, we recur to standard criteria used to optimise the predictive performance. We call lasso-minFootnote 3 the model attained by minimising the cross-validation error, which is recognised to optimise predictive performance. Secondly, we optimise the tuning parameter by recurring to BIC variants: lasso-bic is attained by minimising the BIC, while lasso-ebic05 and lasso-ebic10 are attained by minimising the extended BIC, with moderate and high model complexity regulated by the tuning parameter \(\gamma = 0.5\) and \(\gamma = 1.0\), respectively. To evaluate the performance of randomisation, we choose the stability selection method in the variant proposed by Shah et al. (2013), by imposing different thresholds of the selection probability. The results are very similar by changing the selection probability, and here, we present those attained with the thresholds 0.7, named ssmb. As a screening method, we run the SIS algorithm and consider the performance produced by different number of selected predictors, from 1 to 25, labelled from sis-k1 to sis-k25. Finally, we evaluate the performance of the simple lasso obtained by imposing different number of selected predictors, from 1 to 25, labelled from lasso-k1 to lasso-k25.

The results obtained with lasso-based methods are compared with those attained through the standard LDA analysis, used here as a benchmark and, as such, applied as follows. Over the randomised training sets, we select solutions corresponding to an a priori fixed number of topics, from 1 to 25, labelled from lda-k1 to lda-k25, with hyperparameters set to default values, as discussed in Sect. 4.5.

Lasso-based specifications with an average number of selected variables (over bootstrap replications) are compared to LDA specifications with the same number of components.

4.4 Methodological and computational aspects

We specify the linear regression model

$$\begin{aligned} Y = X\beta +\varepsilon \end{aligned}$$

where \(Y\in {\mathbb {R}}^N\) denotes the response variable, \(X\in {\mathbb {R}}^{N\times P}\) is the DTM matrix, \(\beta \in {\mathbb {R}}^P\) is the vector of regression coefficients and \(\varepsilon \in {\mathbb {R}}^N\) is an independent and identically distributed error term.

At each bootstrap replication, the DTM matrix is partitioned accordingly in \(X_1\in {\mathbb {R}}^{N_1\times P}\) and \(X_2\in {\mathbb {R}}^{N_2\times P}\), where \(N_1\) and \(N_2 = N-N_1\) denote the number of observations in the training set and in the test set, respectively. In our applications, we have fixed \(N_1 = 2/3N\). Note that we have m bootstrap replications of the pair \((Y_h,X_h)\), \(h=1,2\), i.e. \((Y_h^{(m)},X_h^{(m)}), m = 1,\dots M\) and \(M=500\), but for ease of notation we drop the superscripts.

In the training set, lasso (Tibshirani 1996) solves the penalised least squares problem

$$\begin{aligned} {{\hat{\beta }}}_\lambda ^{{\tiny {lasso}}} = \arg \min _{\beta \in {\mathbb {R}}^P} \left\{ \parallel Y_1-X_1\beta \parallel ^2 + \lambda \sum _{j=1}^P|\beta _j|\right\} \end{aligned}$$
(1)

where \(\parallel \cdot \parallel\) denotes the Euclidean norm and \(\lambda >0\) is a tuning parameter which shrinks to zero some coefficients and, consequently, makes the corresponding variables irrelevant.

In our applications, the tuning parameter is selected based on several methods. One criterion consists in minimising the K-fold cross-validation (Stone 1974), error, with \(K=10\),

$$\begin{aligned} \text {CV}(\lambda ) = \frac{1}{K}\sum _{k=1}^K\sum _{i\in k{\tiny {-th}}}(Y_{1i}-X_{1(i)} {\hat{\beta }}_{\lambda , -k})^2 \end{aligned}$$

where \(Y_{1i}\) is the generic element of \(Y_1\in {\mathbb {R}}^{N_1}\), \(X_{1(i)}\in {\mathbb {R}}^{P}\) denotes the ith row of \(X_1\), the index \(k:\{1,\dots ,N_1\}\rightarrow \{1,\dots ,K\}\) indicates the partitions to which each ith observation is allocated by the randomisation in the kth fold, and \({\hat{\beta }}_{\lambda ,-k}\) is the estimate of \(\beta\) obtained by lasso, without the contribution of the observations in the kth fold.

The Bayesian information criterion (BIC) by Schwarz (1978) is based on minimisation of the following objective function,

$$\begin{aligned} \text {BIC}(\lambda )= N_1\log {\hat{\sigma }}^2_\lambda + df(\lambda )\log (N_1) \end{aligned}$$

where \({{\hat{\sigma }}}^2_\lambda = \frac{1}{N_1}\sum _{i=1}^N(Y_{1i}-X_{1(i)}{\hat{\beta }}_\lambda )^2\) and \(df(\lambda )\) is the effective degrees of freedom parameter for which an unbiased and consistent estimator is the number of nonzero coefficients (Zou et al. 2007).

The extended BIC (eBIC) by Chen and Chen (2008) adds an extra penalty term \(\gamma \in (0,1)\) that accounts for the model complexity, summarised by the term \(\tau _j= {P \atopwithdelims ()j}\), where j is the number of covariates considered in the model,

$$\begin{aligned} \text {eBIC}_\gamma (\lambda ) = N_1\log {\hat{\sigma }}^2_\lambda + df(\lambda )\log (N_1) + 2\gamma df(\lambda ) \log ( \tau _j). \end{aligned}$$

Stability selection based on lasso is discussed in detail in Meinshausen and Bühlmann (2010), section 2.2. The key concept is the stability path, given by the probability of each variable to be selected when randomly resampling from the data, over all the values of the regularisation parameter. Specifically, lasso provides estimates of the set of nonzero coefficients as \({\hat{S}}_\lambda = \{j:{\hat{\beta }}_{\lambda ,j}\ne 0\}\), where \({\hat{\beta }}_{\lambda ,j}\) is an element of \({\hat{\beta }}_{\lambda }\) in equation (1). Let I be a random subsample of \(\{1, \dots , N\}\) drawn without replacement. For every set \(J \subseteq \{1, \dots , P\}\), the probability of being in the selected set \({\hat{S}}_\lambda (I)\) is \(\pi _J^\lambda = {\mathbb {P}} \{ J \subseteq S_\lambda (I)\}\). For every variable \(j=1,...,P\), the stability path is given by the selection probabilities \(\pi _j^\lambda\) across \(\lambda\). For a cut-off \(\pi _0\in (0,1)\) and a set of regularisation parameters \(\Lambda\), the set of stable variables is defined as \(S^{{\tiny {stable}}} =\{j:\max _{\lambda \in \Lambda }{\hat{\pi _j^\lambda \ge }} \pi _0\}\). Here we apply the complementary pairs version of stability selection by Shah and Samworth (2013), which improves error control.

As a further criterion, we select the value of \(\lambda\) associated with at least k nonzero coefficients, i.e. we estimate \({\hat{\lambda _k}} = \arg \min _{\lambda \in \Lambda }\text{ card }\{{\hat{S}}_\lambda \}\ge k\), \(k=1,\dots ,25\).

The sure independence screening (Fan and Lv 2008) is based on correlation learning which filters out the variables that have weak correlation with the response. Let \(S_* = \{j:\beta _{j}\ne 0\}\) denote the true model. SIS selects \(S_\xi = \{j:|\omega _j| \text{is among the first} [\xi N] largest \}\) where \([\xi N]\) denotes the integer part of \(\xi N\), \(\xi \in (0,1)\) and \(\omega = X'Y\). SIS enjoys the sure screening properties, i.e. \({\mathbb {P}}(S_*\subset S_\xi )\rightarrow 1\) for \(N\rightarrow \infty\).

Once the relevant variables have been selected, they form the new DTM matrix \(X_2\in {\mathbb {R}}^{N_2\times P^*}\), where \(P^*\) denotes the number of relevant variables eventually selected by each procedure, and the regression coefficients are estimated by ordinary least squares as the solution of the system

$$\begin{aligned} X_2'X_2 {{\hat{\beta }}}^* = X_2'Y_2. \end{aligned}$$

For the case of the lasso, Belloni and Chernozhukov (2013) discuss some additional assumptions to show that the post-estimation OLS, also referred to as post-lasso, performs at least as well as the lasso itself.

4.5 Comparison with LDA

The results obtained by text regression are compared with those obtained by the unsupervised generative model LDA. We do not exploit the many variants of LDA for short text neither the supervised LDA, as LDA is presented here only as a general benchmark. LDA assumes that each document in the corpus can be described as a probabilistic mixture of T topics, and as an output, LDA provides the probability of document d belonging to topic t, \({\mathbb {P}}(t|d)\), where d = 1,...,D indicates the number of documents and t = 1,...T indicates the number of topics. In turn, each topic is defined by a probabilistic distribution over the vocabulary of size P; for each topic, the word probability vector \({\mathbb {P}}(v|t)\) where v=1,...,P indicates the number of words describes how likely it is observing a word conditional on a topic. LDA proceeds through posterior inference of the latent topics given the observed words, and as a conjugate prior to the multinomial distribution, LDA uses Dirichlet priors.

In this analysis, the Dirichlet prior hyperparameters are set as default values (\(\alpha = 0.1, \beta = 0.05\)) and the model is estimated using collapsed Gibbs sampling, as described in (Jones 2019).

As a next step, in each comparison of LDA to lasso-based specifications, the number of LDA topics, T, is fixed equal to the average number of selected variables (over bootstrap replications), \(P^*\). We generate indicator variables assuming values equal to one in correspondence to each document where the topic displays topic probability, \({\mathbb {P}}(t|d)\), larger than the average topic probability, \(1/T\sum _{t=1...T} {\mathbb {P}}(t|d)\), slightly modifying the procedure by Schwarz (2018), based on the largest topic probability. The number of indicator variables equals the number of topics, T, which, in its turn, in each comparison, equals the average number of selected variables, \(T=P^*\). In that case, \(X_2\) is replaced by \(X_2^*\in {\mathbb {R}}^{N_2\times P^*}\), which collects the \(P^*\) indicator variables. Eventually, the response variable (either ratings o prices) is regressed over the indicator variables derived by topics and collected in \(X_2^*\) and the performance is evaluated as for the lasso-based methods.

As far as the selected words comparison is considered, the most frequently selected words over bootstrap replications are considered for Lasso-based specifications. As regard as lda-k, the most frequently top \(P^*\) terms of each topic in each bootstrap replication are considered after discarding duplicates. Indeed the same word may appear within the top \(P^*\) terms of more than a component.

On the one side, we expect that in terms of predictive \(R^{2}\) the comparison a priori favours LDA specifications, as k components in LDA embed more information than k words alone. On the other side, the comparison of selected words extracted by shrinkage methods to top-terms of LDA components, which certainly sounds more artificial, may notwithstanding provide useful information on relative ability to detect relevant words.

4.6 Computational details

All the computations are carried out based on the R software. In particular, lasso is implemented by the glmnet package by Friedman et al. (2010). Stability selection is run using the ssmb package by Hofner and Hothorn (2017) and Hofner et al. (2015). Sure independence screening is carried out through the package SIS (Saldana and Feng 2018). LDA topic model is fit by using textmineR (Jones 2019).

5 Results

5.1 Case study 1: prices

5.1.1 Predictive \(R^2\)

The main results of case study 1, in terms of number of selected variables and predictive \(R^2\), are displayed in Table 5 and summarised in Fig. 1.

Table 5 Prices. Number of selected variables and predictive \(R^2\)

As expected, lasso-min always selects, on average, a very high number of predictors in both categories, to reach the highest adjusted and predictive \(R^2\) and by explaining 0.773 of price variation for knitwear and 0.694 for dresses replicated datasets. Indeed, as it has been observed it tends to be generous in selecting noisy variables. The number of predictors selected by lasso-min displays the highest standard deviation.

Fig. 1
figure 1

Prices. Number of selected variables and predictive \(R^2\). Averages over bootstrap replications

When lasso is optimised by minimising the eBIC, it employs few parameters; the higher the value of the tuning parameter, the smaller the number of selected variables, on average. The most parsimonious eBIC selects, on average, 14.2 predictors for knitwear and 19.4 predictors for dresses, to explain about 61% of the price variability in new datasets. Stability selection is quite parsimonious as well: it selects 6.3 predictors for knitwear and 9.0 for dresses, explaining a share of 0.499 and 0.404 of prices’ predictive variance within categories. Predictive \(R^2\) are just lower than the ones of lasso-ebic10, though based on quite less predictors. Moreover, it has to be observed that lasso models based on the ebic optimisation tend to select a limited number of coefficients on average, but with a certain degree of heterogeneity over the replications.

For sake of brevity, having results for \(k=1,2,\dots ,25\), Table 5 presents results for SIS and lasso computed with the same (average) fixed number of predictors as ssmb and lasso-ebic10, i.e. \(\bar{P^*} =6\) (knitwear) and \(\bar{P^*}=9\) (dresses) for stability selection and \(\bar{P^*}=14\) (knitwear) and \(\bar{P^*}=19\) (dresses) for lasso optimised with eBIC with \(\gamma =1\).

Focusing on the SIS method, we note that it produces very similar results as compared to ssmb and lasso-ebic10 when a comparable number of predictors are imposed. The same pattern may be observed when lasso is used by fixing the number of selected variables. Figure 1 sketches an overview of presented methods for the two categories. Note that, as expected, the performance in terms of predictive \(R^2\) increases, with diminishing derivative, as long as the number of predictors grows.

Fig. 2
figure 2

Prices. Number of coefficients. Averages over bootstrap replication

Figure 2 displays the standard deviation versus the average number of coefficients for a selection of models, highlighting that the ssmb constantly ensures parsimonious models.

A first finding is that the two methods producing the most parsimonious variables selection are lasso-ebic10 and ssmb, where the last one seems to be preferable in terms of robustness. A similar performance is reached by simple lasso or SIS, with the same, fixed, number of predictors.

5.1.2 Inclusion frequency and model class reliance

We now come to the analysis of inclusion frequencies and the model reliance. The analysis covers a selection of models: lasso optimised with eBIC with \(\gamma =1\); stability selection; SIS and lasso computed with the same (average) fixed number of predictors as ssmb and lasso-ebic10, and LDA with the number of topics equal to the same (average) fixed number of predictors as ssmb and lasso-ebic10. Tables 6 and 7 display the most occurring variables picked in lasso-ebic10, ssmb and in SIS or lasso with a priori fixed coefficients over the knitwear and dresses dataset, respectively. For each specification, the tables present the average inclusion frequency and the highest model class reliance (MCR) by word and on average.

The results display levels of mean inclusion frequency comparable across the variable selection methods with some heterogeneity. A similar picture is found in terms of the highest MCR, i.e. comparable levels across lasso-based methods, always higher as compared to LDA results.

In the knitwear dataset (Table 6), MCR shows that, when 6 variables are selected, the gain in the residual standard error amounts to 7.1 percentage points in mean, by 3.5 to 3.8 percentage points when 14 variables are selected.

Table 6 Selected words and inclusion frequency for the Knitwear dataset
Table 7 Selected words and inclusion frequency for the Dresses dataset

Table 7 displays selected words and bootstrap inclusion frequencies over the dresses dataset. Text regression methods provide similar levels of average inclusion frequencies. The highest MCR values confirm that lasso-based variable selection methods, more often than LDA, identify words indicating important variables. When 9 words are selected, the MCR increase the gain of about 5 percentage points in mean, in case of selection of 19 words by about 3.5 percentage points in mean.

Eventually, by the analysis of the estimated coefficients and the coefficients of variations, see Fig. 3, it clearly emerges that few of them are detected out of the bulk of data, thus indicating a negative note in terms of model robustness for sis, which have selected features whose coefficients result to exhibit a disproportionate variability.

5.1.3 Summary

We may conclude that the results attained through text regression methods always significantly overcome the performance of lda, in terms of predictive \(R^2\). Mean inclusion frequencies of lda models, on the other hand, are comparable to the ones attained by other methods. At the same time, the ability to select important variables favours text regressions. Overall, for prices datasets, our findings recommend the use of text regression methods. Among the latter, the best performance in terms of the three measures considered throughout the analysis is provided by ssmb, i.e. stability selection, that is both parsimonious and stable, in the sense that it is not affected by high variability in the estimated coefficients.

Fig. 3
figure 3

Prices. CV of estimates of coefficients

5.2 Case study 2: ratings

5.2.1 Predictive \(R^2\)

Table 8 displays the results related to the second case study, conducted with the aim of explaining ratings with tokens drawn by open questions. In this example, lasso-bic tends to select the largest number of variables, capable of explaining around 78.4% of predictive ratings variation. As in the previous example, also in this case, lasso-min produces, on average, a high number of predictors, to reach the highest predictive \(R^2\). Figure 4 (left panel) confirms that high predictive power may be reached only at the cost of selecting a very large number of predictors. Note that lasso-min and lasso-bic are not included in the figure as we have maintained the same scale of Fig. 1, for sake of comparison. More parsimonious models are selected by ssmb and lasso-ebic10, based on less nine variables or less, even if also lasso optimising the ebic yet displays higher variability for the number of coefficients (see Fig. 4, right panel, in the same scale of Fig. 2). Both the lasso model which minimises eBIC and stability selection guarantee an acceptable trade-off between explanatory power and parsimony. As in case study 1, lasso and SIS perform comparably well as ssmb and lasso-ebic10, when the number of predictors is a priori fixed, with an out-of-sample \(R^2\) oscillating around 30 to 35%.

Differently than in the case of prices, with ratings, our findings are slightly in favour of topic modelling, as the lda predictive \(R^2\), corresponding, respectively, to solutions with 7 and 9 topics, to preserve comparability, are slightly higher than the ones reached through text regression methods.

Table 8 Ratings. Number of selected variables and predictive \(R^2\)
Fig. 4
figure 4

Ratings. Number of selected variables and predictive \(R^2\) (left). Number of coefficients (right). Averages over bootstrap replications

5.2.2 Inclusion frequency and model class reliance

Table 9 compares the most frequently selected features over the five hundred bootstrap replications by the methods, as well as the highest MCR of selected variables. Words have been drawn by the two open questions asking to indicate, respectively, pros and cons. Words deriving by the positive field, pros, are presented preceded by the prefix p while words deriving by the negative significant, cons, are preceded by the prefix c. The more parsimonious the models, the higher the inclusion frequency of selected variables. For both ssmb and lasso-ebic10, the most relevant variables are selected by more than 65% of replications. It is evident that in terms of both persistence and relevant variables, ssmb and lasso-ebic10 outperform all the other methods. As far as lda is concerned, top selected words display slightly lower mean inclusion frequency rates over the considered methods.

Table 9 Selected words and inclusion frequency for the ratings dataset

Concerning the ability to detect important variables, note that, in all cases, the highest MCR values only negligibly overcome the unitary value, meaning that the selected variables, on average, are able to decrease the loss in predictive power only of about 1 percentage point.

5.2.3 Summary

These results discussed above are corroborated by Fig. 5, confirming that, except for the highest coefficient of variation of lasso-ebic10, all the methods behave in a much more homogeneous way, with respect to the previous case study, in terms of the inclusion frequency and predictive \(R^2\).

Fig. 5
figure 5

Ratings. CV of estimates of coefficients

All in all, in this second case study, among text regression methods, stability selection is the preferred method, as it ensures an acceptable trade-off between explanatory power and parsimony. In addition, it outperforms the other methods in terms of persistence and relevant variables. However, differently than in the case of prices, with ratings, our findings are slightly in favour of topic modelling. The reasons can be related to the fact that in this example (a) sentences are not as short as in the case of prices and (b) the words have an emotional content, in this case, stronger than in the case of attributes, such as freedom as opposed to wool.

On this, and with the focus on interpretability, it is worth remarking that selected words have to be read jointly with most co-occurrent words. The task of understanding the meaning to which each word refers is usually performed in lda by looking at the top words with most probability to occur within each topic. To follow a similar path in text regression, we consider the tokens that show the greatest co-occurrence, in a sort of topic reconstruction, displayed in Table 10. In this way, each selected word is accompanied by some further ones, among which it plays the pivotal role. As well as for standard topic models, such as lda, each group of topics has to be interpreted. The researchers focused on the field of study, jointly looking at the top hth token, find the best meaning for the topic.

In summary, in this case study, our results do not strongly support to use text regression methods but favour lda. The latter performs slightly better than text regression methods in explaining ratings data. We argue that this can be related to the length of the data text, in this second case study longer than in the first one, and on the very nature of the case study itself, where joint co-occurrence of words within the pros and cons fields—rather than single words alone—should explain the topic, as motivations underlying the ratings.

Table 10 Reconstructed topics (co-occurrence greater than 20%) from pivotal words

6 Concluding remarks

The paper has investigated the analytics that allow one to exploit the informative content and the explanatory power of unstructured, short texts on a response variable.

Interpretability of the results was a key issue for the scope of the present study; hence, we have restricted our focus to shrinkage methods, and within this class, we have favoured models that provide results of easier interpretation.

In this perspective, we have compared the explanatory power of variables selected through several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection. A subsequent comparison has been also run with the widely applied topic model, i.e. LDA, used as a benchmark. The relative performance of the methods has been assessed based on the number and the importance of the selected variables.

We have considered two applications. The first application focused on explaining prices of goods within a product category, based on the captions provided by manufacturers on e-commerce platforms. In this case study, the nature of the texts is descriptive, as they characterise the goods for sale; after the text pre-processing phase, texts are very short, reduced in the median to only three words. The second application aimed to understand how to use open questions to obtain information on overall satisfaction within surveys. After the text pre-processing phase, texts are short, with 15 words in the median, but longer than in the previous case. Furthermore, here texts express opinions, and in particular satisfaction or dissatisfaction; thus, compared to the first case study, they are much more related to the emotional sphere.

The results of the study attain insights into two main directions concerning, on the one side, the performance of analysed models within the class of text regression and, on the other side, the different ability of text regression versus topic modelling methods in extracting information from short texts.

Along the first direction, our findings show that, in terms of explanatory power, both stability selection and lasso which optimises the eBIC criterion are able to improve the lasso, when the latter is optimised through standard criteria finalised to prediction purposes. Nevertheless, by limiting the number of selected variables, both lasso and sure independence screening are capable of attaining comparable results. As far as the ability to select relevant explanatory predictors is concerned, stability selection slightly outperformed the other methods, which, notwithstanding, exhibited good performance. In our opinion, a relevant finding is that lasso behaves as well as alternative computationally more intensive methods when the number of selected variables is limited.

Concerning the comparison of text regression with LDA, the former outperforms LDA in terms of explanatory power in the prices case study while LDA outperforms text regressions in the ratings case study. This is likely to happen both because texts are longer in the ratings case study and because of their contents, which are naturally more connected to latent topics.

In terms of quality of the selected words, text regression overcomes LDA in the case of prices but not entirely in the case of ratings. However, words selected based on text regressions are always more robust than LDA ones, so that, in both cases, text regressions appear highly suitable to pick up relevant words within a bag of words.

To conclude, we remark that the results of the paper describe how text regression and variable selection methods work over two specific applications and cannot be extensively generalised if not with further extensive analyses. However, our findings favour variable selection in text regressions as a method that may provide valuable solutions when texts are short and open the way to further investigations.