Skip to main content
Log in

An overview and empirical comparison of natural language processing (NLP) models and an introduction to and empirical application of autoencoder models in marketing

  • Methodological Paper
  • Published:
Journal of the Academy of Marketing Science Aims and scope Submit manuscript


With artificial intelligence permeating conversations and marketing interactions through digital technologies and media, machine learning models, in particular, natural language processing (NLP) models, have surged in popularity for analyzing unstructured data in marketing. Yet, we do not fully understand which NLP models are appropriate for which marketing applications and what insights can be best derived from them. We review different NLP models and their applications in marketing. We layout the advantages and disadvantages of these models and highlight the conditions under which different models are appropriate in the marketing context. We introduce the latest neural autoencoder NLP models, demonstrate these models to analyze new product announcements and news articles, and provide an empirical comparison of the different autoencoder models along with the statistical NLP models. We discuss the insights from the comparison and offer guidelines for researchers. We outline future extensions of NLP models in marketing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others


  1. For expositional ease, we use the terms, model and algorithm, interchangeably throughout the paper.

  2. We exclude sentiment analysis, whose main purpose is to identify the polarity and measure the sentiment (affect, emotion) in a text typically in terms of positive, negative, or neutral disposition using lexicons such as LIWC, SentiWordnet, and General inquirer as the reference.

  3. The softmax function is normalized exponential function or a logistic function generalized to multiple dimensions.

  4. The softplus function is a function that smoothly approximates the ReLU (Rectifier Linear Unit) activation function, a piecewise linear function that outputs the input when the input is positive or zero otherwise.

  5. We recognize that other autoencoders such as sparse encoder, denoising encoder, and contractive autoencoder may be useful in other contexts such as image processing (e.g., Ng, 2011).


  • Agrawal, D., & Schorling, C. (1996). Market share forecasting: An empirical comparison of artificial neural networks and multinomial logit model. Journal of Retailing, 72(3), 383–407.

    Article  Google Scholar 

  • Aletras, N., & Stevenson, M. (2013). Evaluating topic coherence using distributional semantics. In Proceedings of the 10th international conference on computational semantics (IWCS'13) long papers (pp. 13–22).

    Google Scholar 

  • Altszyler, E., Signman, M., & Slezak, D.F. (2017). Corpus specificity in LSA and Word2vec: The role of out-of-domain documents. arXiv:1712.10054v1.

  • Archak, N., Ghose, A., & Ipeirotis, P. G. (2011). Deriving the pricing power of product features by mining consumer reviews. Management Science, 57(8), 1485–1509.

    Article  Google Scholar 

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  • Balducci, B., & Marinova, D. (2018). Unstructured data in marketing. Journal of the Academy of Marketing Science, 46(4), 557–590.

    Article  Google Scholar 

  • Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.

    Article  Google Scholar 

  • Berger, J., Humphreys, A., Ludwig, S., Moe, W. W., Netzer, O., & Schweidel, D. A. (2020). Uniting the tribes: Using text for marketing insight. Journal of Marketing, 84(1), 1–25.

    Article  Google Scholar 

  • Bischoff, J.M. & Airoldi, E.M. (2012). Summarizing topical content with word frequency and exclusivity. Proceedings of the 29thinternational conference on machine learning, June, 9–16.

  • Blei, D. M., Ng, A. M., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  • Boulton, C. (2019). Introducing GAIL: Great wolf Lodge’s AI for pinpointing guest sentiment. CIO.

  • Bowman, S.R., Potts, C., & Manning, C.D. (2014). Recursive neural networks can learn logical semantics. arXiv preprint arXiv:1406.1827.

  • Büschken, J., & Allenby, G. M. (2016). Sentence-based text analysis for customer reviews. Marketing Science, 35(6), 953–975.

    Article  Google Scholar 

  • Büschken, J., & Allenby, G. M. (2020). Improving text analysis using sentence conjunctions and punctuation. Marketing Science, 39(4), 727–742.

    Article  Google Scholar 

  • Caldieraro, F., Zhang, J. Z., Cunha, M., & Shulman, J. D. (2018). Strategic information transmission in peer-to-peer lending markets. Journal of Marketing, 82(2), 42–63.

    Article  Google Scholar 

  • Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language processing research. IEEE Computational Intelligence Magazine, 9(2), 48–57.

    Article  Google Scholar 

  • Chen, S.F., Beeferman, D., & Rosenfeld, R. (1998). Evaluation metrics for language models. DARPA Broadcast News Transcription and Understanding Workshop.

  • Chiang, W. K., Zhang, D., & Zhou, L. (2006). Predicting and explaining patronage behavior toward web and, traditional stores using neural networks: a comparative analysis with logistic regression. Decision Support Systems, 41(2), 524–531.

    Article  Google Scholar 

  • Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on machine learning, 160–167.

  • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493–2537.

    Google Scholar 

  • Crain, S.P., Zhou, K., Yang, S., & Zha, H. (2012). Dimensionality reduction and topic modeling: From latent semantic indexing to latent dirichlet allocation and beyond. In Mining text data: Springer, 129-161.

  • Cui, D., & Curry, D. (2005). Prediction in marketing using the support vector machine. Marketing Science, 24(4), 525–648.

    Article  Google Scholar 

  • Cui, G., Wong, M. L., & Lui, H. (2006). Machine learning for direct marketing response models: Bayesian networks with evolutionary programming. Management Science, 52(4), 597–612.

    Article  Google Scholar 

  • Darani, M., & Shankar, V. (2020). Topic hidden Markov model (THMM): A new machine learning approach to making dynamic purchase prediction. Working paper. Texas A&M University.

    Google Scholar 

  • Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  • Dong, G., Liao, G., Liu, H., & Kuang, G. (2018). A review of the autoencoder and its variants: A comparative perspective from target recognition in synthetic-aperture radar images. IEEE Geoscience and Remote Sensing Magazine, 6(3), 44–68.

    Article  Google Scholar 

  • Dotzel, T., & Shankar, V. (2019). The relative effects of business-to-business (vs. business-to-consumer) service innovations on firm value and firm risk: An empirical analysis. Journal of Marketing, 83(5), 133–152.

    Article  Google Scholar 

  • Eliashberg, J., Hui, S. K., & Zhang, Z. J. (2007). From story line to box office: A new approach for green-lighting movie scripts. Management Science, 53(6), 881–893.

    Article  Google Scholar 

  • Ghose, A., Ipeirotis, P. G., & Li, B. (2019). Modeling consumer footprints on search engines: An interplay with social media. Management Science, 65(3), 1363–1385.

    Article  Google Scholar 

  • Goldberg, Y. (2016). Primer on neural network models for natural language processing. Journal of Artificial Intelligence Research. 57, 345–420. arXiv:1807.10854.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. in Advances in Neural Information Processing Systems, 2672–2680.

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

    Google Scholar 

  • Guerreiro, J., Rita, P., & Trigueiros, D. (2016). A text mining-based review of cause-related marketing literature. Journal of Business Ethics, 139(1), 111–128.

    Article  Google Scholar 

  • Hartmann, J., Huppertz, J., Schamp, C., & Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36(1), 20–38.

    Article  Google Scholar 

  • Heaven, W.D. (2020). OpenAI’s new language generator GPT02 is shockingly good-and completely mindless. MIT Technology Review.

  • Heitmann, M., Landwehr, J.r., Schreiner, T. F., & van Heerde, H. J. (2020). Leveraging brand equity for effective visual product design. Journal of Marketing Research, 57(2), 257–277.

  • Herhausen, D., Ludwig, S., Grewal, D., Wulf, J., & Schoegel, M. (2019). Detecting, preventing, and mitigating online firestorms in brand communities. Journal of Marketing, 83(3), 1–21.

    Article  Google Scholar 

  • Hermosilla, M., Gutiérrez-Navratil, F., & Prieto-Rodríguez, J. (2018). Can emerging markets tilt global product design? Impacts of Chinese colorism on Hollywood castings. Marketing Science, 37(3), 356–381.

  • Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1–2), 177–196.

    Article  Google Scholar 

  • Hovy, D., Melumad, S., & Inman, J. J. (2021). Wordify: A tool for discovering and differentiating consumer vocabularies. Journal of Consumer Research, 48(3), 394–414.

    Article  Google Scholar 

  • Hu, M., Dang, C., & Chintagunta, P. K. (2019). Search and Learning at a Daily Deals Website. Marketing Science, 38(4), 609–642.

  • Humphreys, A., & Wang, R. J. (2018). Automated text analysis for consumer research. Journal of Consumer Research, 44, 1274–1306.

    Article  Google Scholar 

  • Hutchins, J. (2006). Machine translation; A concise history. Machine translation: A concise history, J Hutchins -

  • Jacobs, B., Donkers, B. J., & Fok, D. (2016). Model-based purchase predictions for large assortments. Marketing Science, 35(3), 389–404.

    Article  Google Scholar 

  • Jalali, N., & Papatla, P. (2019). Composing tweets to increase retweets. International Journal of Research in Marketing, 36(4), 647–668.

    Article  Google Scholar 

  • Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.

  • Kelly, R. (2016). PyEnchant a spellchecking library for Python. Ηλεκτρονικό]. Available:

  • Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.

  • Kingma, D.P. & Welling, M. (2013). Auto-encoding variational Bayes. arXiv e-prints, p. arXiv:1312.6114. Available:

  • Le, D.T., Nguyen, C.T., Ha, Q.T., Phan, X.H,, & Horiguchi, S. (2008). Matching and ranking with hidden topics towards online contextual advertising. In 2008 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, 1 (IEEE), 888–891.

  • Lee, T. Y., & Bradlow, E. T. (2011). Automated marketing research using online customer reviews. Journal of Marketing Research, 48(5), 881–894.

    Article  Google Scholar 

  • Lee, D., Hosanagar, K., & Nair, H. S. (2018). Advertising content and consumer engagement on social media: Evidence from Facebook. Management Science, 64(11), 5105–5131.

    Article  Google Scholar 

  • Lemmens, A., & Croux, C. (2006). Bagging and boosting classification trees to predict churn. Journal of Marketing Research, 43(2), 276–286.

    Article  Google Scholar 

  • Li, J., Monroe, W., Ritter, A., Galley, M., Go, J. & Jurafsky, D. (2016). Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.

  • Liu, J., & Toubia, O. (2018). A semantic approach for estimating consumer content preferences from online search queries. Marketing Science, 37(6), 930–952.

  • Liu, X., Singh, P. V., & Srinivasan, K. (2016). A structured analysis of unstructured big data by leveraging cloud computing. Marketing Science, 35(3), 363–388.

  • Liu, X., Lee, D., & Srinivasan, K. (2019). Large-scale cross-category analysis of consumer review content on sales conversion leveraging deep learning. Journal of Marketing Research, 56(6), 918–943.

  • Lundberg, S. M., & Lee, S. (2017). A unified approach to interpreting model predictions. Proceedings of the 31st international conference on neural information processing systems.

  • Marr, B. (2019), What is unstructured data and why is so important to businesses? An easy explanation for anyone. .

  • Melumad, S., Inman, J. J., & Pham, M. T. (2019). Selectively emotional: How smartphone use changes user-generated content. Journal of Marketing Research, 56(2), 259–275.

    Article  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

  • Moro, S., Pires, G., Rita, P., & Cortez, P. (2019). A text mining and topic modeling perspective of ethnic marketing research. Journal of Business Research, 103, 275–285.

    Article  Google Scholar 

  • Nam, H., Joshi, Y., & Kannan, P. K. (2017). Harvesting brand information from social tags. Journal of Marketing, 81(4), 88–108.

    Article  Google Scholar 

  • Netzer, O., Feldman, R., Goldenberg, J., & Fresko, M. (2012). Mine your own business: Market-structure surveillance through text mining. Marketing Science, 310(3), 521–543.

    Article  Google Scholar 

  • Netzer, O., Lemaire, A., & Herzenstein, M. (2019). When words sweat: Identifying signals of loan default. Journal of Marketing Research, 56(6), 960–980.

    Article  Google Scholar 

  • Ng, A. (2011). Sparse autoencoder. CS294A Lecture Notes, 72, 1-19.

  • Ordenes, F. V., Ludwig, S., De Ruyter, K., Grewal, D., & Wetzels, M. (2017). Unveiling what is written in the stars: Analyzing explicit, implicit and discourse patterns of sentiment in social media. Journal of Consumer Research, 43(6), 875–894.

  • Pan, Y., Huang, P., & Gopal, A. (2019). Storm clouds on the horizon? New entry threats and R&D investments in the US IT industry. Information Systems Research, 30(2), 540–562.

    Article  Google Scholar 

  • Perotte, A.J., Wood, F., Elhadad, N., & Bartlett, N. (2011). Hierarchically supervised latent Dirichlet allocation. in Advances in Neural Information Processing Systems, 2609–2617.

  • Puranam, D., Narayan, V., & Kadiyali, V. (2017). The effect of calorie posting regulation on consumer opinion. Marketing Science, 36(5), 726–746.

    Article  Google Scholar 

  • Reisenbichler, M., & Reutterer, T. (2019). Topic modeling in marketing: Recent advances and research opportunities. Journal of Business Economics, 89(3), 327–356.

    Article  Google Scholar 

  • Ribeiro, A., Matos, L. M., Pereira, P. J., Nunes, E. C., Ferriera, A. L., Cortez, P., & Pilastri, A. (2020). Deep dense and convolutional autoencoders for unsupervised anomaly detection in machine condition sounds. Decision and Classification of Accoustic Scenes and Events.

    Google Scholar 

  • Röder, M., Both, A. & Hinneburg, A. (2015). Exploring the space of topic coherence measures. in Proceedings of the Eighth International Conference on Web Search and Data Mining.

  • Rutz, O. J., Sonnier, G. P., & Trusov, M. (2017). A new method to aid copy testing of paid search text advertisements. Journal of Marketing Research, 54(6), 885–900.

  • Srivastava, A., & Sutton, C. (2017). Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488.

  • Taylor, C. (2019). Structured vs. unstructured data. Datamation.

  • Tirunillai, S., & Tellis, G. J. (2014). Mining marketing meaning from online chatter: Strategic brand analysis of big data using latent Dirichlet allocation. Journal of Marketing Research, 51(4), 463–479.

    Article  Google Scholar 

  • Toubia, O., Iyengar, G., Bunnell, R., & Lemaire, A. (2019). Extracting features of entertainment products: A guided latent Dirichlet allocation approach informed by the psychology of media consumption. Journal of Marketing Research, 56(1), 18–36.

  • Vermeer, S. A. M., Arujo, T., Bernritter, S. F., & van Noort, G. (2019). Seeing the wood for the trees: How machine learning can help firms in identifying relevant electronic word-of-mouth in social media. International Journal of Research in Marketing, 36(3), 492–508.

    Article  Google Scholar 

  • Villarroel Ordenes, F., Grewal, D., Ludwig, S., Ruyter, K. D., Mahr, D., & Wetzels, M. (2019). Cutting through content clutter: How speech and image acts drive consumer sharing of social media brand messages. Journal of Consumer Research, 45(5), 988–1012.

    Article  Google Scholar 

  • Vinyals, O., & Le, Q. (2015). A neural conversational model. arXiv preprint arXiv:1506.05869.

  • Wang, Y., Huang, M., Zhu, X., & Zhao, L. (2016). Attention-based LSTM for aspect-level sentiment classification. In proceedings of the 2016 conference on empirical methods in natural language processing, 606-615.

  • Weber, N., Shekhar, L., & Balasubramanian, N. (2018). The fine line between linguistic generalization and failure in Seq2Seq-attention models. arXiv preprint arXiv:1805.01445.

  • West, P., Brockett, P. L., & Golden, L. L. (1997). A comparative analysis of neural networks and statistical methods for predicting consumer choice. Marketing Science, 16(4), 370–391.

    Article  Google Scholar 

  • Xiong, G., & Bharadwaj, S. (2014). Prerelease buzz evolution patterns and new product performance. Marketing Science, 33(3), 401–421.

    Article  Google Scholar 

  • Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3), 55–75.

    Article  Google Scholar 

  • Zhong, N., & Schweidel, D. (2020). Capturing changes in social media content: A multiple latent changepoint topic model. Marketing Science, 39(4), 827–846.

    Article  Google Scholar 

  • Zhou, M., Duan, N., Liu, S., & Shum, H. (2020). Progress in neural NLP: Modeling, learning, and reasoning. Engineering, 6, 275–290.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Venkatesh Shankar.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Dhruv Grewal served as Guest Editor for this article.


Appendix 1

NLP models text preprocessing and representation

NLP involves the following broad processes (e.g., Berger et al., 2020; Netzer et al., 2019).

  • Understanding language syntax: This process includes tasks such as grammar induction, parts-of-speech tagging, parsing, stemming, lemmatization, and morphological segmentation.

  • Understanding language semantics: NLP models impute the meaning of characters, words, and sentences of languages by inferring lexical and distributional semantics, machine translation, named entity recognition, automated language generation, optical character recognition, sentiment analysis, topic segmentation and classification, and much more.

  • Applied language communication: This process involves automated summarization of texts, discourses performance, speech recognition/augmentation, text to speech conversion, and dialogue creation.

Text preprocessing

Before being able to run an NLP model, the data, which are usually in the form of human readable texts, should be converted into a well-ordered input suitable for analysis by algorithms. To do so, the texts must be converted to numeric data that would allow adequate representation of the text corpus. Text preprocessing allows for tools and techniques that increase the efficiency of this conversion process. The raw data must be encoded in the correct format, usually done with Unicode Transformation Format 8-bit (UTF-8). These data can be downloaded from web crawlers with hypertext markup language (HTML) tags (e.g., Facebook, Twitter, websites). These data may also contain emoticons that provide some meaning, so they must be suitably processed (Lee et al., 2018; Ordenes et al., 2017).

Following this step, the data need to be broken down into smaller units (or tokens) which could be characters, words, sentences, or paragraphs based on the type of modeling desired. This is often done by defining some delimiter that serves as a reference, and the process is referred to as tokenization. In some cases, even spelling check and correction might be necessary (Kelly, 2016). Assigning grammatical reference or other rule-based definition to the entity of the token is helpful in semantic interpretation. This reference is accomplished through “parts-of-speech” tagging tasks that involve attributing the grammatical context to words (Pan et al., 2019). Often most of the words in the language do not impart any significant meaning to the message but just supplement the linguistic flow and language appropriateness. This issue may not hold for NLP models since they are designed to perform highly specific tasks, and retaining all the words will only add to the computational burden. Therefore, peripheral words (known as stop words) are eliminated. These words are drawn from popular lists of such words evolved from research. Finally, stemming and/or lemmatization can be applied to relate the word to its root form.

Text representation

This process converts the processed text into a numeric representation that could be used in the NLP model for training and evaluation. Two representations, namely, discrete representation and distributed representation, exist. One-hot encoding, Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) are three common forms of discrete representation.

One-hot encoding is the simplest way to convert the corpus of texts into vectors by representing each word as a vector with the length of text as the dimension of that vector. Each word will have a value of one for a column indicated by its position in the corpus and zero otherwise. However, it is a computationally intensive and a memory inefficient process.

Bag of Words (BoW) is another approach that uses the frequency count of each word in the corpus to vectorize the input texts. It uses a column entry for each element in each unique term in the cleaned corpus. This process also has limitations in terms of the high dimensionality of input, loss of semantic relationship among the words in the corpus, and occurrence of common words as high frequency terms.

Another approach, the TF-IDF approach, addresses this issue by collecting the term frequency of a word in a document and the proportion of documents that contain the word. The TF-IDF score w can be expressed as follows:

$$ {w}_{ij}=t{f}_{ij}\times \mathit{\log}\left(\frac{N}{d{f}_i}\right) $$

where each word i in document j has a term frequency tfij and an IDF that has N as the total number of documents in the corpus and dfi as the count of documents with word i. This approach also suffers from high dimensionality, but techniques like hashing can be used in large-scale datasets to overcome this problem.

However, distributed representation is a more improved and efficient way for text representation that uses the word’s meaning in generating the vectors. Deep learning-based models like Word2Vec and Global Vectors (GloVe) are context independent word embeddings initially proposed by Mikolov et al. (2013) and are built on Continuous Bag of Words (CBOW) and skip gram models, while language models such as Fasttext, Bidirectional Encoder Representations from Transformers (BERT) and ELMo rely on the context under which they are trained.

Appendix 2

Tutorial on autoencoder training

Using different autoencoders for extracting topics requires a collection of documents comprising the texts. The following are the broad steps to apply autoencoder models in any given context.

  1. 1.

    The first step is the text pre-processing. Standard processes to clean texts include but are not limited to stop words removal, character encoding and filtering, stemming, lemmatization, and tokenization. The steps needed to pre-process would depend on the corpus present and the subsequent text vectorization method selected. Natural language toolkits such as NLTK ( and SpaCy ( are popular packages with several text pre-processing capabilities. For instance, unlike texts sourced from Wikipedia articles, web-scraped text datasets may need an additional step to remove the html tags. If the vectorization method used is the bag of words (BoW) or TF-IDF (vectors not representative of word semantics), it is important to have a clean collection of tokens after pre-processing, so a more thorough investigation could be carried out. However, advanced models like Word2Vec and GloVe, with their capability to learn and infer word semantics, can alleviate this problem.

  2. 2.

    Once the corpus is cleaned, the next step is to vectorize the entire corpus. For this step, we use all the available documents that lead to a comprehensive vocabulary for the autoencoder model to be trained. A vectorization method should be used to get a feature representation of all the tokens in the cleaned corpus. Alternatively, pre-trained Word2Vec (say from, glove (, or more recent models could be used to get word embeddings like BERT ( that are interesting to study. Moreover, an embedding layer from TensorFlow 2.0 (also available in PyTorch) can be used to train embeddings simultaneously with the autoencoder training. Thus, a document can be converted into a collection of tokens, each of which is represented as an n-dimensional vector.

  3. 3.

    The third step is to use the vectorized corpus for autoencoder training. TensorFlow, PyTorch, or other packages may also be used to construct the models with all the right architectures for the different autoencoder models. The constructed model with the described layers should be followed by a custom loss function. The loss function is a weighted combination of KL-divergence and reconstruction loss. The weights should be consistent across all the autoencoder models. The model is fed to the entire vector corpus to reconstruct the same documents. The hidden representations in the bottleneck layer learn different topic associations through the training process. The loss plots for all autoencoder models should be monitored to see if the loss stabilizes with training epoch. A set random seed could be used to ensure the reproducibility of the results. Typically, not all models perform equally well for a given dataset. In some cases, it may be hard to have every model converge over the epochs. Hence, it is recommended that all the autoencoder models should be run on a given dataset. Generally speaking, the Self-attention, the Convolutional, and the Bidirectional LSTM models perform best in that order.

  4. 4.

    The final step is to determine the predictive performance of the autoencoder topic models, using the perplexity, the coherence, and the FREX (Frequency Exclusivity) scores. Lower values of perplexity scores indicate better-performing models, while higher values of coherence scores reveal better-performing models. Coherence scores have different formulations, and online tools like Palmetto ( could be used to evaluate the scores from all the autoencoders. Each score indicates performance on a particular criterion. These scores can be used to select the best performing autoencoder model for the given dataset. In addition, domain-specific judgment should be used when determining the performance of the models as each score is inherently limited in its capacity. It is possible for different autoencoders to have different scores on different datasets.

Appendix 3

Table 13 Examples of new product announcements in the new product announcements dataset
Table 14 Examples of news articles in the news articles dataset

Appendix 4

Illustrating differences in topic learning among models

We address the question of what qualitative insights on topic learning that we can gain from the autoencoder models that perform better than the other NLP models.

AEs offer a more versatile language context-specific learning than the traditional models, that is, the bottleneck (topic) layer of AEs assess activations on all word vector dimensions (in our case 100 dimensional word vectors) to attribute topic association. In contrast, statistical models like LDA use overall term/word frequency and estimated term/word frequency within the selected topic for topic assignment and do not consider language semantics and polysemy. Thus, the assignment in such models is not as exhaustive as it is for AEs. The trained AE models yield similar activations among the topic neurons of the topic layer if the document contains the corresponding topics. We can conclude that the greater the difference between activations across the word vector dimensions, the more confident AE is in assessing the presence or absence of any topic. To further illustrate this issue, we can visualize the activations of trained AE models for a given announcement using plots of the magnitudes of activations across the neurons of the topic layer.

Consider the announcement “Dell (Nasdaq:DELL), America’s leading computer systems company for small businesses, today added two new products to its line of easy-to-use PowerConnect(tm) network switches for U.S. customers.”

With the LDA model, we get the topic assignment of one cluster (the business aspects/implications of announcement) based on word frequency count. With AEs, we can use the activation plots shown in Fig. 3.

Fig. 3
figure 3

Activation maps for all AEs for the sample announcement

The three lines in the map show the three neurons corresponding to each topic. The figures show the activations for all the dimensions on which the words are vectorized across the documents (100 in this case). All the AE models have one topic neuron that has a greater activation than the other two neurons, indicating its presence in the announcement. However, the figures also show that not all dimensions are equally activated for LSTM and BiLSTM. These differences suggest the presence of a different topic cluster (the product description and features cluster). A similar inference can be made for each announcement. Thus, unlike LDA, where word frequencies are the predominant determinant of cluster assignment, the activation values on several dimensions decide the assignment for AEs. This investigation can be extended to enhance the explainability of AE models, consistent with Lundberg (2017).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shankar, V., Parsana, S. An overview and empirical comparison of natural language processing (NLP) models and an introduction to and empirical application of autoencoder models in marketing. J. of the Acad. Mark. Sci. 50, 1324–1350 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: