1 Introduction

The pivotal role of food in shaping human life, health, and well-being is well-acknowledged and extensively studied (Achananuparp et al. 2018; Ludwig et al. 2018; Nordström et al. 2013). A fundamental aspect of understanding the impact of food on health outcomes involves the analysis and understanding of the composition of our diets (Menichetti et al. 2023).

In recent years, the increasing availability of significant amounts of food-related data, including food images, recipes, and taste preferences from diverse sources such as social networks and recipe-sharing websites, has ignited interest in modern computational methods for food analysis. Given the sheer volume and variety of food data, the application of machine learning, data mining, and data analytics techniques to the realm of food computing is emerging as a bridge connecting the intricate world of food with services tailored to human needs (Min et al. 2019). Significant research efforts have been invested in harnessing the wealth of online food data to create food recommendation systems that cater to a wide range of audiences (Phanich et al. 2010; Trattner and Elsweiler 2017; Ege and Yanai 2017; Banerjee and Nigar 2019; Iwendi et al. 2020). A critical facet of this field involves the development of health-conscious food intelligence systems. Achieving this task necessitates profound domain expertise and a comprehensive understanding of recipes that goes beyond mere data extraction. It involves grappling with the intricacies of ingredients, their quantities, and processing methods concerning diverse dietary needs. There is a pressing demand for precise entity recognition and extraction to facilitate the accurate interpretation of recipes and empower intelligent systems to provide health-oriented recommendations and dietary guidance. Correspondingly, several research projects set out to address this need (Trattner and Elsweiler 2017; Freyne and Berkovsky 2010; Teng et al. 2012; Elsweiler et al. 2017).

However, the translation of these research efforts into practical applications, particularly those reliant on concrete entity recognition to facilitate intelligent food analysis and recommendation systems, faces a significant impediment in the form of the current state of available datasets. While commendable progress has been made in curating annotated datasets for biomedical entities and broader Natural Language Processing (NLP) tasks (Lample et al. 2016; Huang et al. 2015), there is a substantial dearth of equivalent resources in the food domain. This scarcity poses a significant hindrance to the development of robust models capable of handling the intricate information retrieval tasks integral to the burgeoning field of food computing.

Present large volume food benchmark datasets, such as Food-101 (Bossard et al. 2014), while valuable, primarily concentrate on image recognition or recipe and image classification, with limited attention to recipe text-based information retrieval. Text-based datasets, such as the Carnegie Mellon University Recipe Database (CURD) corpus (la Torre Frade et al. 2008) and the recipe flow graph (r-FG) corpus (Mori et al. 2014), have limitations primarily related to their scale and scope. Recent initiatives, like FoodBase (Popovski et al. 2019) and TASTEset (Wróblewska et al. 2022), have made strides in named entity recognition (NER) within food datasets. Yet, these resources remain constrained in terms of scale and comprehensiveness. The absence of large-scale, high-quality text corpora exclusively targeting the food domain presents a substantial challenge in the development and enhancement of food-specific NLP models.

Hence, there is an imperative need for substantial investment in the creation/collection of robust, annotated food-centric text corpora. These resources could serve as a driving force for advancements in food computing, effectively bridging the current gap between existing capabilities and the potential for sophisticated, health-conscious food recommendation systems.

Our research seeks to make significant contributions to resource development and the creation of robust models, both critical for the functionality of intelligent food recommendation and analysis systems. In this article, we concentrate on the Ingredient NER problem. Specifically, we define and address the Ingredient NER problem as follows:

Definition 1

Ingredient NER: Given sets \(\mathcal{S}_l = \{\mathcal {S}^{[i]}, \mathcal {T}^{[i]}\}^{N}_{i=1}\) and \(\mathcal {S}_{u} = \{\mathcal {S}^{[i]}\}^{N}_{i=1}\) representing the tagged and untagged sets of ingredient list entries, where \(\mathcal {S}^{[i]} = \{w^{[i,j]}\}_{j=0}^{M}\) and \(\mathcal {T}^{[i]} = \{t^{[i,j]}\}_{j=0}^{M}\). The variable \(w^{[i,j]}\) \(-\)represents sequentially ordered words that are semantically correct but may exhibit different syntax configurations w.r.t recipe ingredient expressions, while \(t^{[i,j]}\) is the corresponding word tags, respectfully. Given a new ingredient expression \(\hat{\mathcal {S}}^{[i]} = \{\hat{w}^{[i,j]}\}_{j=0}^{M}\), the objective is to infer the appropriate tag for each word \(\hat{w}^{[i,j]}\) in the sequence by leveraging knowledge from the labeled and unlabeled (in the semi-supervised approach) datasets.

We wish to emphasize the following key points concerning the defined problem:

  • Due to the absence of a universal syntax for constructing recipe ingredient list entries, ingredient expressions \(\mathcal {S}\), even from the same sources, can exhibit different syntax structures, although they convey semantically identical information. For example, 1 sheet puff pastry frozen (thawed) and puff pastry 1 sheet frozen (thawed) both convey the same semantic meaning but possess distinct syntax. As discussed in Sect. 3, these syntactic variations pose challenges for several state-of-the-art (SOTA) ingredient tagging models. Consequently, there is a pressing need for a robust model capable of comprehending and effectively handling diverse input syntaxes.

  • Current NER models rely on extensive and high-quality datasets for training, a requirement exacerbated by the diverse input syntax coupled with the wide variety of ingredients encountered. While numerous ingredient expression datasets are available (e.g., RecipeNLG), manual annotation costs render them insufficient for NER model training.

To advance the field of ingredient NER and address the aforementioned challenges, we aim to develop a robust ingredient NER model with the following advanced features:

  • Syntax Agnosticism: Our model is designed to be syntax-agnostic, allowing it to recognize and correctly tag entities within ingredient expressions. Specifically, given two ingredient expressions, \(\mathcal {S}^{[i]} = \{w^{[i,1]}, w^{[i,1]}, \ldots , w^{[i,M]}\}\) and \(\mathcal {S}^{[a]} = \{w^{[a,1]}, w^{[a,1]}, \ldots , w^{[a,M]}\}\), which are semantically equivalent, the resulting tags \(\mathcal {T}^{[i]} = \{t^{[i,1]}, t^{[i,1]}, \cdots , t^{[i,M]}\}\) and \(\mathcal {T}^{[a]} = \{t^{[a,1]}, t^{[a,1]}, \cdots , t^{[a,M]}\}\) should be identical for corresponding words in the phrases.

  • Semi-Supervised Learning (SSL): With a limited tagged dataset and a vast amount of untagged ingredient expression datasets, our model is designed to efficiently learn from both labeled and unlabeled datasets. It harnesses the information in unlabeled data to enhance its inference capabilities effectively.

In this paper, we introduce SINERA, an efficient and robust NER model, and its semi-supervised counterpart, SINERAS. SINERAS effectively captures the intrinsic data structure by learning from labeled and unlabeled data. Our extensive experimental evaluation demonstrates the superior performance of our proposed model, surpassing existing SOTA methods, all while reducing the volume of trainable weights and computational resources compared to recent Large Language Models (LLMs). We adopt a multi-view data input approach, which expands the model’s interpretative capabilities. We harness the potential of SSL and data augmentation techniques to boost the models’ performance.

In addition to the model, we introduce the ARTI dataset, a novel large-scale resource for the Ingredient NER domain. This dataset includes over 70,000 annotated recipe ingredient expressions, categorized into seven distinct classes. It also contains over 1 million unannotated ingredient expressions for unsupervised/semi-supervised use. This diverse dataset significantly augments existing resources, providing a substantial foundation for advanced research, innovative methodologies, and sophisticated applications in the field of food computing.

The remaining sections of this paper include a discussion of related work in Sect. 2, an exploration of the spurious correlation problem in Sect. 3, an introduction to the proposed SINERA model in Sect. 4, an overview of the dataset and the results of our experimental evaluations in Sect. 6, and finally, a conclusion and discussion of future work in Sect. 7. In this manuscript, we utilize the terms “category” and “tag” interchangeably to denote the classifications of words within ingredient lines.

2 Related works

2.1 Named entity recognition

In natural language processing, NER has long been studied in order to automate extracting and classifying entities from text. Traditional approaches have relied on Conditional Random Fields (CRFs), a statistical modeling technique proposed by (Lafferty et al. 2001). Subsequent to these early methodologies, deep learning had transformative impact, spearheaded by the concept of Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) as essential element in the design of recurrent neural networks. Capitalizing on this development, (Huang et al. 2015) amalgamated a CRF layer with LSTM cells, improving sequence tagging accuracy.

Recently, attention-based models have been steadily gaining ground in the NER area due to their capacity to support massively parallel training and resolve concerns with lengthy dependencies in recurrent architectures. Prominently, the utilization of BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018)—a paradigmatic shift based on the Transformer framework (Vaswani et al. 2017)—has shown to be particularly effective. This finding has been replicated in numerous studies (Devlin et al. 2018; Wu and He 2019; Xu et al. 2019), with notable performance in the specialized domains of biomedical NER (bioNER) (Lee et al. 2020; Alsentzer et al. 2019; Beltagy et al. 2019).

In the specific domain of food computing, a series of research works have applied SOTA models on unique food-centric corpora. For instance, explorations leveraging the FoodBase dataset (Popovski et al. 2019) have utilized models like Bidirectional LSTM (BiLSTM) (Cenikj et al. 2020), and fine-tuning BERT and its variants (Perera et al. 2022; Stojanov et al. 2021; Cenikj et al. 2021). Further advances in the field have been made following the presentation of the TASTEset dataset (Wróblewska et al. 2022), introducing a novel benchmark through the strategic fine-tuning of BERT integrated with a CRF layer.

Despite remarkable progress in previous works, these efforts have primarily been confined to small-scale datasets. A crucial task in our research is to explore the potential of NER models to generalize across larger-scale food datasets effectively. Besides, the considerable computational cost of fine-tuning large language models like BERT cannot be ignored. To address these challenges, our work charts a new direction by proposing a relatively smaller parameter size model. Notwithstanding its computational efficiency, this streamlined model rivals BERT in accuracy, thereby emerging as a promising solution in the landscape of ingredient NER models.

2.2 Semi-supervised learning on text classification

The rise of SSL as a potent methodology stems from its ability to straddle the challenges of data annotation—the intensive time and cost, demanded specialist knowledge, and the overwhelming abundance of accessible yet unlabeled data. SSL has been extensively leveraged within the text classification domain, showcasing its malleable nature and pronounced value in augmenting model performance.

Self-training, a foundational approach in SSL, harnesses the model’s inference capability, providing an expanded training set, consequently amplifying the model’s learning and generalization abilities. It iteratively generates predictions on unlabeled data and effectively creates a feedback loop for continuous improvement. Notable techniques have exploited this concept with remarkable success, such as pseudo-labeling (Lee 2013), which crafts class labels from the model’s predictions. An extension of this approach is demonstrated in Unsupervised Data Augmentation (UDA) (Xie et al. 2020), which encourages output consistency between an unlabeled example and its augmented counterpart. Likewise, FixMatch (Sohn et al. 2020) incorporates consistency loss to instances with pseudo-labels, particularly focusing on predictions that exude a high confidence level.

Furthermore, self-training strategies have also been efficiently extended into unsupervised representation learning. Deep Embedded Clustering (DEC) (Xie et al. 2016) introduces an iterative refinement approach for cluster assignments, particularly within computer vision. In this method, an auxiliary target distribution is generated from the existing soft-assignment predictions. This auxiliary distribution serves to direct the model in refining its cluster allocations in subsequent iterations. Building upon this foundation, confidence-based Kullback–Leibler (KL) Regularizer (Stein 2020) extrapolated the principles of DEC to the NER domain. By substituting the auxiliary target distribution in favor of direct model predictions, experiments showcased notable performance improvements, surpassing the established BERT-CRF baseline performance metrics on a sizable Kaggle dataset (Jaswani 2020) comprising 48,000 sentences.

In our proposed methodology, we assume the data is generated by a mixture of \(\mathcal {C}\) Gaussian components. Specifically, we frame the task as a Bayesian Gaussian mixture model (BGMM) problem. We estimate these assumed components on the unlabeled dataset via a Variational inference strategy. We then maximize the embedding distance between intracomponent entities. By learning to estimate the mixture components generating the data, the model learns to capture the data’s inherent structure. This auxiliary information is then directed toward model regularization for entity tagging.

3 Spurious correlation problem

This work elucidates previously unrecognized spurious correlations associated with entity positions within the food NER domain. Our findings reveal that such correlations are widespread among numerous baseline models, hindering their ability to discern the underlying linguistic patterns. In this section, we delineate the spurious correlation issue and introduce a rule-based recipe manipulation technique designed to mitigate the problem.

3.1 Problem overview

There is an expanding corpus of evidence pointing towards neural networks’ propensity to harness spurious correlations in practical contexts (Xiao et al. 2020; Niven and Kao 2019; Hewitt and Manning 2019). This issue becomes especially salient in high-risk application use cases, where inaccurate outcomes may have severe impact. In these scenarios, models have been observed to prioritize superficial characteristics over intrinsic feature constructs. Within NLP, even models that excel in benchmark evaluations might not necessarily demonstrate profound linguistic acumen; instead, they might draw from and operate on inadvertent patterns evident in skewed datasets (Gururangan et al. 2018; Kaushik and Lipton 2018). For instance, in sentiment analysis, an underlying bias may be observed where elongated reviews predominantly receive positive labels, leading models to consider length as a determinant of sentiment, thus sidelining a thorough content analysis (Zhang and Wallace 2015; Angelidis and Lapata 2018). A similar effect also manifests in NER tasks, where models, swayed by data tendencies, may prematurely conclude that capitalization alone signifies the presence of a unique noun such as the name of a man or organization (Devlin et al. 2018; Akbik et al. 2018a).

A spectrum of methodologies for textual data enhancement has been introduced, with a notable portion tailored for NER tasks that can mitigate the challenges posed by spurious correlations. One key strategy is the substitution method. This strategy includes techniques like synthetic data generation, where new data points are crafted by adjusting existing ones, and entity replacement, which swaps out specific terms in a text with others with similar characteristics. These methods destabilize non-representative patterns, steering models towards a more authentic understanding and away from mere surface-level interpretations. However, harnessing these substitution techniques necessitates an intricate comprehension of the task domain, including extensive familiarity with feature embedding. Moreover, sampling-based approaches offer a divergent perspective, such as Data Re-sampling (Chawla et al. 2002) and Bootstrapping (Riloff and Jones 1999). By carefully selecting the data samples, these methods remove data instances that could misguide models, ensuring a wholesome learning experience from untainted data. Meanwhile, randomization-centric techniques, such as Back-translation (Sennrich et al. 2015)—where texts undergo a bilingual transformation process—and contextual perturbation (Wei and Zou 2019) that introduces intentional variations to texts, bring diversity into the mix without demanding extensive data insights. Such techniques inherently bolster model adaptability, aiding in circumventing unwarranted spurious patterns.

3.2 Spurious correlation to ingredient entity position

Considerable research efforts have been directed toward mitigating spurious correlations in the field of NER. However, to the best of our knowledge, there are no prior studies specifically examining the issue within the context of ingredient NER. The challenges encountered in this particular domain are manifold, stemming from the limited size of available datasets, the inherent diversity of ingredients, and the varying patterns of ingredient expression in recipes. Consequently, unconscious spurious correlations may inadvertently be introduced, leading models to prioritize superficial shortcuts over the underlying linguistic patterns inherent in sentences. Moreover, the skewed test sets fail to adequately evaluate a model’s capacity to discern linguistic patterns and entities; instead, such test sets may yield inflated performance metrics by capitalizing on learned shortcuts that lack generalizability in real-world data.

To exhibit the pervasive issue of spurious correlations within the food NER domain and to take an initial step toward bridging this gap, we conducted an analysis of the TASTEset and ARTI datasets (see Sect. 6), with a particular focus on the distribution of entities in relation to their encoded positions within sentences. As depicted in Fig. 1, we present the position distribution of each category across the entire dataset. Notably, QUANTITY and UNIT entities strongly correlate with position, predominantly appearing in the first and second places of a recipe line. In addition, several categories display a normal distribution with a single peak, indicating that their occurrences are concentrated within a specific range. It is important to note that such correlations are not necessarily reflected in real-world ingredient expressions. For instance, “3.0 thick slice crusty bread” could also be phrased as “crusty bread 3.0 thick slice” or “crusty bread thick 3.0 slice”, which are equally common in real-world contexts. In these examples, QUANTITY and UNIT are not necessarily the first and second entities in a recipe. Such undesirable patterns in the dataset invite models to exploit shortcuts to enhance performance. To quantitatively assess the detrimental effects of these patterns on model learning, we compared the performance of models trained on the original dataset with those trained on a manipulated dataset that more closely resembles a real-world distribution, as detailed in Sect. 6.4.

Fig. 1
figure 1

Positional distribution of categories within the TASTEset and ARTI datasets. (Color figure online)

3.3 Synthetic recipe manipulation

In order to address the issue of spurious correlation within the food NER domain, we introduce Synthetic Recipe Manipulation (SRM), a rule-based recipe manipulation technique. SRM aims to augment the syntactic diversity of recipes while maintaining the occurrence of entities and preserving the original meaning of the recipe. Notably, SRM achieves this without relying on contextual information or introducing additional entities. Table 1 show the proposed SRM data manipulation rules developed based on the ARTI dataset and extends to other datasets with different tags (e.g., TASTEset) by substitution:

Table 1 Proposed rules for data augmentation. (Color table online)

The presented guidelines provide a systematic methodology for the rearrangement of ingredient expressions with the objective of preserving both consistency and diversity within the dataset. The transformations include switching the order of quantity, unit, and ingredient names, rearranging the placement of state descriptors, and exchanging segments when separated by a comma. Adhering to these guidelines guarantees the generation of a varied dataset suitable for data augmentation when applicable. The SRM rules are applied to individual ingredient expressions with a predetermined probability. As a result, the SRM method yields a synthetic augmentation of the original dataset, increasing the total number of recipe instances. To extend the applicability of SRM to other food NER datasets, such as the TASTEset, we employ a categorization approach to ensure generalizability across datasets with differing categories. Specifically, we implemented category equalizations within certain contexts in the TASTEset data. We consider the following substitutions to be equivalent:

  • QUALITY (PHYSICAL QUALITY) can replace STATE or TEMP.

  • PROCESS can replace STATE.

  • COLOR, which serves as a food descriptor, is always included before FOOD.

  • PURPOSE, which is often a comment to describe the intent of a particular ingredient/action, can optionally be pushed to the end.

The effectiveness of SRM in mitigating spurious correlations is evident from the comparison between the original and augmented positional distribution of entities in the TASTEset and ARTI datasets, as depicted in Fig. 2. In the manipulated datasets, SRM purposefully constructs a long-tail distribution for the QUANTITY and UNIT entities, extending their emergence to other positions and thus breaking the strong correlation between entity category and position. This contrasts with the original datasets, where these entities primarily occupied the initial positions in recipe lines. Additionally, SRM introduces substantial heterogeneity in the positional distribution of other categories, enabling unorthodox positions that deter models from exploiting shortcuts based on positional patterns.

Fig. 2
figure 2

Positional distribution of categories within the augmented TASTEset and ARTI datasets

4 Proposed model

4.1 Notation

  • \(\mathcal {S} = \{\{\mathcal {S}^{[1]}, \mathcal {T}^{[1]}\}, \{\mathcal {S}^{[2]}, \mathcal {T}^{[2]} \}, \dots , \{\mathcal {S}^{[N]}, \mathcal {T}^{[N]}\}\}\) is a set of recipe ingredient list entries, where each ingredient expression is composed of a sequence of words \(\mathcal {S}^{[i]} = \{w^{[i,1]}, w^{[i,2]}, \cdots , w^{[i,M]}\}\) and corresponding tag set \(\mathcal {T}^{[i]} = \{t^{[i,1]}, t^{[i,2]}, \cdots , t^{[i,M]}\}\). For the unlabeled set \(\mathcal {S}_u\), the tag set \(\mathcal {T}\) is an empty set \(\emptyset\).

  • \(t^{[i,j]} \in T^{[i]}\) is the tag associated with each word \(w^{[i,j]}; \quad j \leftarrow 1 \dots M\) in the words

  • \(\mathcal {X}_{w} = \{x^{[i,j]}_{w}\}; \quad j \leftarrow 1 \dots M\) is a set of word-level embedding of each word in a given sequence \(\mathcal {S}^{[i]}\)

  • \(\mathcal {X}_{r} = \{x^{[i,j]}_{r}\}; \quad j \leftarrow 1 \dots M\) is a set of character-level embedding of each word in a given sequence \(\mathcal {S}^{[i]}\)

  • \(\mathcal {X}_{p} = \{x^{[i,j]}_{p}\}; \quad j \leftarrow 1 \dots M\) is a set of POS-tag embedding given POS-tags \(\mathcal {B}_i = \{b^{[i,1]}, b^{[i,2]}, \cdots , b^{[i,M]}\}\) associated with each word in a given sequence \(\mathcal {S}^{[i]}\)

  • \(\mathcal {X}_{g} = \{x^{[i,j]}_{g}\}; \quad j \leftarrow 1 \ldots M\) is the aggregation of the multi-view embedding of each word in a given sequence \(\mathcal {S}^{[i]}\)

  • \(h^{[i,j]} \in \mathcal {H}^{[i]}\) is the learned contextual embedding of each word \(w^{[i,j]}\)

  • \(f(.; \theta )\) is a neural network-based function depending on a set of parameter \(\theta\).

  • N, M, and K represent specific parameters: N signifies the total count of ingredient expressions, M indicates the word size, and K represents the tag size (Algorithm 1).

    Algorithm 1
    figure a

    Predict the set of tags associated to each word in a given ingredient expression \(P(\hat{\mathcal {T}}^{[i]} | \mathcal {S}^{[i]})\)

4.2 Architecture overview

Our proposed model embraces a multifaceted approach to represent input samples, allowing us to harness the richness of diverse data views. Figure 3 provides an illustration of our model’s architecture. SINERA is designed to learn from both labeled data, represented as \(\mathcal {S}_l=\{\{\mathcal {S}^{[1]}, \mathcal {T}^{[1]}\}, \{\mathcal {S}^{[2]}, \mathcal {T}^{[2]}\}, \cdots , \{\mathcal {S}^{[N]}, \mathcal {T}^{[N]}\}\}\), and unlabeled data, presented as \(\mathcal {S}_u =\{\mathcal {S}^{[1]}, \mathcal {S}^{[2]}, \cdots , \mathcal {S}^{[\mathcal {N}]}\}\), in the context of ingredient lists. Upon completing the end-to-end training process, SINERA can efficiently predict the tags for new ingredient expressions.

Fig. 3
figure 3

The SINERA architecture. SINERA leverages multiple data views, including word-level \(\mathcal {X}_w\), character-level \(\mathcal {X}_r\), and POS-level \(\mathcal {X}_b\) perspectives, through FastText \(f_w\), GRU \(f_r\), and Flair POS models. These views are integrated within a decoder-only transformer \(f_{D}(., \theta _{D})\) to capture rich contextual information and enhance NER accuracy. The classifier \(f_{C}(., \theta _{C})\) generates final tag predictions, while the GMM model \(f_{A}(., \theta _{A})\) via Bayesian variational inference aids learning from unlabeled data

For a given ingredient expression, the model extracts word-level features for each word using FastText (Bojanowski et al. 2017) to generate static word embeddings \(x^{[i,j]}_{w} = f_w(w^{[i,j]})\). FastText is chosen for its proficiency in handling out-of-vocabulary (OOV) words by incorporating sub-word level information, providing robust word-level perspectives on the samples. To capture intricate sequential dependencies and morphological structures at the character level within words, a Gated Recurrent Unit (GRU) layer is employed to produce character-level word views \(x_{r}^{[i,j]} = f_r(w^{[i,j]}; \theta _r)\). This allows the model to account for the character-level formulation of words. Furthermore, the model learns representations \(x_{b}^{i, j} = f_b(w^{[i,j]}; \theta _b)\) of part-of-speech (POS) tags (Akbik et al. 2018b) associated with words. These three data views (i.e., word level, character level, and POS level) are integrated as input into a decoder-only transformer model. This integration captures a multi-view perspective of the samples (Toutanova et al. 2003). The transformer model, with its self-attention mechanism, empowers our model to capture global dependencies within the input data, enhancing contextual understanding (Vaswani et al. 2017). The decoder generates the final embeddings fed to the classifier (\(f_c(.; \theta _c)\)) for word tag prediction. The classifier, a dense layer, maps the enriched multi-view representation to predicted classes effectively.

In an extension to learn from unlabeled data, we propose the use of Bayesian variational inference to learn a coarse partitioning of the words in ingredient expressions. The partition function takes the learned embedding representations of words as input and is modeled as an unsupervised method to learn a k-cluster assignment of the input. We will provide detailed explanations for each of these components in the subsequent sections.

4.3 Generation of multi-view embeddings

In the context of an ingredient expression \(\mathcal {S}^{[i]} = \{w^{[i,1]}, w^{[i,2]}, \cdots , w^{[i,M]}\}\) and their corresponding POS (Part of Speech) tags \(\mathcal {B}^{[i]} = \{b^{[i,1]}, b^{[i,2]}, \cdots , b^{[i,M]}\}\), our model is designed to create a tri-view embedding for each word within the expression. The first view is the word embedding, represented as \(\mathcal {X}_w\). To generate this embedding, we employ fastText (Bojanowski et al. 2017). We opt for fastText due to its exceptional generalization capabilities and its aforementioned resilience in handling external factors such as spelling errors and OOV words.

Given that recipes can be released by various authors and often lack review, the value of a robust model becomes evident. To further enhance the model’s generalization capabilities and robustness, we introduce character-level embeddings. Each word \(w_i\) is composed of multiple characters and is depicted as a character-level word distribution matrix, denoted as \(w^{[i,j]} = \{c^{[i,j]}_{1}, c^{[i,j]}_{2}, \cdots , c^{[i,j]}_{m}\}\), where m is the number of characters in word \(w^{[i,j]}\). The character embedding layer is a trainable component that undergoes end-to-end training alongside the model. Moreover, we employ a Gated Recurrent Unit (GRU) layer to produce character-level word embeddings for each word in an ingredient expression, respectively (see equation 1).

$$\begin{aligned} \mathcal {P}&= \sigma _g (W^z f(c^{[i,j]}_{\tau }) + U^{\mathcal {P}}h^{\tau -1} + b^{\mathcal {P}}),\\ r&= \sigma _g (W^rf(c^{[i,j]}_{\tau }) + U^rh^{\tau -1} + b^r),\\ \tilde{h}^{\tau }&= \sigma _{h^\prime } (Wf(c^{[i,j]}_{\tau })+ r\circ Uh^{\tau -1} + b),\\ h^\tau&= \mathcal {P} \circ \tilde{h}^{\tau } + (1-\mathcal {P}) \circ h^{\tau -1}, \end{aligned}$$
(1)

where \(\circ\) denotes element-wise multiplication, \(\sigma\) is a nonlinear activation function, and f(.) is the learned embedding function. The weights are the variables \(\{W, U\}\).

The POS-tag embedding \(\mathcal {X}_b\) is obtained through a trainable randomly initialized embedding layer that is fine-tuned during the end-to-end training. The embeddings are then aggregated using a linear model and subsequently fed into the decoder layer to acquire an attention-based generalized embedding for each word. In the final step, the decoder generates \(\mathcal {H}\), which denotes the refined multi-view embedding for each word. This process can be summarized as:

$$\begin{aligned} \mathcal {X}_{g}^{[i]}&= f_G(\mathcal {X}^{[i]}_{w}, \mathcal {X}^{[i]}_{r}, \mathcal {X}^{[i]}_{b}; \theta _g), \\ \mathcal {H}&= f_D(\mathcal {X}_{g}^{[i]}; \theta _d). \end{aligned}$$
(2)

Here, \(\mathcal {X}^{[i]}_{r}\) represents the final character-level embedding for the words, \(\mathcal {X}^{[i]}_{b}\) signifies the POS embedding of the words, and \(\mathcal {X}^{[i]}_{w}\) corresponds to the input word embeddings obtained from the fastText model. To classify the target words, this embedding \(\mathcal {H}\) is passed to the classification network denoted as \(f_C(.; \theta _C)\), which is modeled as a single-layer neural network. This network is responsible for predicting the target word.

4.3.1 Modeling coarse cluster distributions

To leverage both labeled and unlabeled data, we adopt a method that learns a coarse cluster distribution from the generalized multi-view word embedding, denoted as \(\mathcal {X}_{g}\), derived from \(f_G\). We assume that the learned embeddings \(x_g^{[i,j]}\) of words within a given ingredient expression follow a Gaussian mixture distribution with a predefined number of components denoted as \(\mathcal {C}\). In this study, we set \(\mathcal {C}\) to the number of classes. This choice of parameter is based on observed performance, leaving a detailed exploration of the optimal component size \(\mathcal {C}\) for this task for future research. The mixture distribution is parameterized by \(\beta = \{\pi , \mu , \sigma \}\), where \(\pi\) represents the mixture proportions, \(\mu = \{\mu _1, \mu _2, \cdots , \mu _{\mathcal {C}}\}\) signifies the means of the mixture components, and \(\sigma = \{\sigma _1, \sigma _2, \cdots , \sigma _{\mathcal {C}}\}\) corresponds to the variances of the mixture components. To model these parameters, we employ Variational Inference due to its ability to capture essential aspects of data generation processes while avoiding singularities (Akujuobi et al. 2020).

For our multivariate Gaussian distribution in d dimensions, we define the models as follows:

$$\begin{aligned} \mu _{k,d}&\sim \mathcal {N}(\delta _{k,d}, \gamma ^{2}_{k,d}) \quad 1 \le k \le \mathcal {C} \\ \sigma _{k,d}^{-2}&\sim \text {Gamma}(\kappa _{k,d}, \tau _{k,d}) \quad 1 \le k \le \mathcal {C} \\ \pi _k&\sim \text{Dirichlet}(\phi _k) \quad 1 \le k \le \mathcal {C} \\ q(\mu )&\sim \mathcal {N}(\mu _{k,d}, \sigma _{k,d}^{-2}) \\ q(\pi )&\sim \text{Categorical}(\pi ) \\ \end{aligned}$$

Here, \(\delta _{k,d}\) and \(\gamma ^{2}_{k,d}\) represent the mean and standard deviation of the component means’ variational posteriors, respectively. \(\phi\) indicates the concentration parameters of the Dirichlet variational posterior, while \(\kappa _{k,d}\) and \(\tau _{k,d}\) correspond to the shape and rate parameters of the Gamma variational posterior. The means are drawn from the approximate distribution \(q(\mu )\), while the probability of a data point being generated by any of the mixture components is modeled by \(q(\pi )\). Stochastic Variational Inference (Hoffman et al. 2013) is employed to approximate the variational parameters, following the “Bayes by Backprop” technique (Blundell et al. 2015).

To train the variational parameters, our goal is to find variational distribution variables \(\theta ^{*}\) that minimizes the Kullback-Leibler (KL) divergence between the variational distribution \(q(\beta |\theta )\) and the true posterior distribution \(q(\beta |\mathcal {X}_g)\):

$$\begin{aligned} \mathcal {L}_{GM}(\mathcal {X}_{g}; \theta ^{*})&= {\arg \min }_{\theta } \frac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^{\mathcal {T}_i} KL (q_{i,j}(\beta |\theta )||p_{i,j}(\beta |x_{g}^{[i,j]})) \\&= \frac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^{\mathcal {T}_i} {\arg \min }_{\theta } \int q_{i,j}(\beta |\theta ) \frac{q_{i,j}(\beta |\theta )}{p_{i,j}(\beta )p_{i,j}(x_{g}^{[i,j]}|\beta )} \\&= \frac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^{\mathcal {T}_i} {\arg \min }_{\theta } KL (q_{i,j}(\beta |\theta ) || p_{i,j}(\beta )) - \mathbb {E}_{q_{i,j}(\beta |\theta )}[\log p_{i,j}(x^{[i,j]}_{g}|\beta )]. \end{aligned}$$

This resulting cost function is known as the Evidence Lower Bound (ELBO), which consists of the data-dependent likelihood cost and the prior-dependent complexity cost. The ELBO term represents the likelihood of the generated data \(\mathcal {X}_g\) fitting a mixture Gaussian with parameter \(\beta\). The cost is optimized using stochastic gradient descent.

4.3.2 Utilizing coarse clustering for representation learning

We introduce a regularization step into the representation learning process using a contrastive loss to leverage the acquired coarse data clustering provided by the Gaussian Mixture Model (GMM). Our goal is to maximize the agreement between words that share similar mixture components. Specifically, we employ the Normalized Temperature-Scaled Cross-Entropy loss (NT-Xent) with the cluster assignments obtained from the GMM model. The contrastive loss function is defined as:

$$\begin{aligned} \mathcal {L}_{XE}(x_i, x_j) = - \log \frac{\exp (\text{sim}(x_{i},x_{j})/ \eta )}{\sum _{k=1}^{2N} \mathbb {1}_{k \ne i} \exp (\text{sim}(x_{i},x_{k})/ \eta )}, \end{aligned}$$
(3)

Here, \(\eta\) represents the temperature parameter, and \(\mathbb{1}_{k \ne i}\) evaluates as true if and only if \(k \ne i\). We employ angular similarity, \(\textrm{sim}(x_i, x_j)=1-\arccos (\cos (x_i, x_j))/\pi\), as the similarity score. \(\cos (x_i, x_j)\) is the cosine similarity between \(x_i\) and \(x_j\). This loss is computed across all words and mixture components. Following this approach, when a positive word is identified within a specific component, we consider all other words in the batch associated with different mixture components as negative samples for contrastive loss computation.

5 Supervised learning loss

To train the decoder and classifier parameters \(\theta _d\) and \(\theta _c\), we employ a dynamically weighted cross-entropy (CE) loss. Specifically, for labeled data samples, the CE loss is calculated as follows:

$$\begin{aligned} \alpha _{k}&= \log \left( \frac{\max (n_k); \; \forall \quad 1 \le k \le K}{n_k}\right) + 1 \\ p^{[i,j]}_{k}&= P(t_{k}^{[i,j]}|h^{[i,j]};\theta )\\ \mathcal {L}_{CE}(\mathcal {H}; \theta )&= -\frac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^{M}\sum _{k=1}^{K} \alpha _{k}^{1-p^{[i,j]}_{k}} t_{k}^{[i,j]} \log p_{k}^{[i,j]} \end{aligned}$$
(4)

In this context, N represents the total count of tagged ingredient expressions, M corresponds to the number of words within an expression, and K is the total number of unique tags. The ground truth label for the ith sentence, the jth entity, and the kth class, is symbolically represented as \(t_{k}^{[i,j]}\), with a binary assignment of 1 denoting that the word’s tag is k, and 0 indicating otherwise. The term \(P(t_{k}^{[i,j]}|h^{[i,j]};\theta )\) represents the model’s predicted probability that the token \(h^{[i,j]}\) is classified as the jth label \(t_{k}^{[i,j]}\). The external summation computes the average across all samples in \(\mathcal {H}\).

The dynamic weights \(\alpha _{k}\) address the class imbalance issue by penalizing the misclassification errors of minority classes (\(\alpha _l\)) more than the majority class. We initialize all weights to 1 and adjust them as proposed in (Fernando and Tsokos 2021). The logarithm smooths the weights for extremely imbalanced classes.

This weighting strategy adapts the contribution of misclassification on each label instance by applying class-wise weights of varying magnitudes from the same label based on the prediction output. This approach enables the model to distinguish between easy and challenging sample instances. By minimizing the CE loss, the model learns an optimal set of parameters \(\theta\) through supervised learning.

5.1 Learning with augmentation

In our efforts to enhance the model’s ability to learn from diverse expressions of ingredients, we introduce a label regularization loss function as follows:

$$\begin{aligned} \mathcal {L}_{RE}(p(h^{[i, j]}), p(\overline{h^{[i, j]}})) = \frac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^{M} KL (p(h^{[i, j]})||p(\overline{h^{[i, j]}})) \end{aligned}$$
(5)

Here, h and \(\overline{h}\) represent the embeddings of the input ingredient expression and its reconstructed variant, respectively. The variable \(p(h^{[i, j]})\) denotes the probability distribution predicted for word \(w^{[i,j]}\). The variables N and M correspond to the sample size and the ingredient expression words length. This regularization loss encourages the tag predictions for an ingredient expression and its corresponding syntactically correct modification to be similar. This promotes the model’s robustness in handling variations in ingredient expression syntax.

5.2 Diffusion-based label consistency enhancement

In order to enhance stability and robustness, we introduce a regularization loss based on diffusion models and parameterized Markov chains. This mechanism prevents the prediction model from deviating significantly from the domain knowledge of the task. It is particularly valuable for smaller datasets, where models may not capture the sequential dependencies of words effectively.

We draw inspiration from previous works on diffusion models (Sohl-Dickstein et al. 2015; Hoogeboom et al. 2021; Song et al. 2020; Austin et al. 2021), which consider various domains, and apply it to our task of ingredient NER. We represent forward transition probabilities using a matrix \(Q_\tau\) for sequence position \(\tau\). However, we assume a uniform transition rule that applies to all sequence points \(\tau\) to avoid the spurious correlation since ingredient expressions have varying lengths and syntaxes. This results in an overall transition matrix \(\bar{Q} = \frac{1}{M}\sum _{\tau =0}^{M} Q_\tau\), obtained by consolidating the individual \(Q_\tau\) matrices. This transition matrix is calculated from only the labeled training data.

We introduce a stabilization objective that encourages accurate word predictions at each corresponding sequence position. The regularization loss can be defined as follows:

$$\begin{aligned} q_{\tau }&= \sigma (p(\hat{t}^{[i,j-1]}| h^{[i,j-1]}) \times \bar{Q})\\ \mathcal {L}_{DF}(\hat{\mathcal {T}})&= \frac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^{M} KL[q_{\tau }|| p(\hat{t}^{[i,j]}| h^{[i,j]})], \end{aligned}$$
(6)

where \(p(\hat{t}^{[i,j]}| h^{[i,j]})\) represents the model’s prediction probability distribution on the sequence word embedding \(h^{[i,j]}\) and \(\sigma\) is a softmax function. The variable \(q_{\tau }\) is the diffusion-based expected tag probability distribution. Incorporating this loss during training improves prediction stability and performance.

5.3 Parameter learning

To train the networks \(f_G({.;\theta _G})\), \(f_D({.;\theta _D})\), \(f_A({.;\theta _A})\), \(f_C({.;\theta _C})\) for tag prediction, we jointly optimize the following losses using the Adam optimizer:

$$\begin{aligned} \mathcal {L} = \Phi \mathcal {L}_{CE} + \omega \mathcal {L}_{RE} + \varpi \mathcal {L}_{DF} + \zeta \mathcal {L}_{XE} + \lambda \mathcal {L}_{GM} \end{aligned}$$
(7)

Here, \(\mathcal {L}_{CE}\) represents the classification loss discussed in Sect. 5, \(\mathcal {L}_{RE}\) is the label augmentation regularization loss discussed in Sect. 5.1, \(\mathcal {L}_{DF}\) is the diffusion loss discussed in Sect. 5.2, \(\mathcal {L}_{XE}\) is the contrastive regularizer from Sect. 4.3.2, and \(\mathcal {L}_{GM}\) is the mixture model loss as detailed in Sect. 4.3.1. In this study, the model is trained end-to-end utilizing a combined loss function, incorporating the parameters \(\Phi\), \(\zeta\), \(\varpi\), \(\omega\), and \(\lambda\) as loss coefficients. We set the values of the loss coefficients through a grid search across a spectrum of values, specifically from the set \(\{1e^{-3}, 1e^{-2}, 1e^{-1}, 1\}\).

6 Experiments

We first present the datasets used, the experimental setting, the quantitative evaluation results with comparisons against baseline methods, and the parameter sensitivity analysis.

6.1 Dataset

6.1.1 TASTEset

The TASTEsetFootnote 1 dataset was initially introduced by Wroblewska et al. (Wróblewska et al. 2022). The dataset comprises 2935 ingredient lines, each associated with 9 tags. However, during our preprocessing and a detailed review of the dataset, we identified certain labeling issues that required correction. Given the dataset’s manageable size, we conducted a manual review and carried out the following revisions:

  • We merged the tag “PART” into the “FOOD” category, as the “PART” tag was often used to describe parts of ingredient names.

  • We manually rectified several mislabeled tags, ensuring they were accurately assigned to the appropriate categories.

Table 2 presents the updated statistics for the revised dataset following our preprocessing and revision efforts.

Table 2 Dataset statistics of the TASTEset dataset

6.1.2 ARTI dataset

The ARTI (Augmented Recipe Tagged Ingredient) dataset is a structured compilation of recipe ingredients expressions categorized into various tags. To create this dataset, we amalgamated data from multiple sources,Footnote 2 incorporating manual refinements for a more precise ingredient expression dataset. In addition to the tagged dataset, we integrated a set of untagged ingredient expressions from the recipeNLG dataset (Bień et al. 2020).

Inspired by the application of TF-IDF (term frequency-inverse document frequency) scores in document summarization and sentence ranking (Christian et al. 2016; Manjari et al. 2020), we developed a quality assessment metric for ingredient expressions, denoted as \(\psi (\mathcal {S}^{[i]})\):

$$\begin{aligned} \psi (\mathcal {S}_{[i]}) = {\left\{ \begin{array}{ll} \frac{1}{M} \sum _{j}^{M} \text {TF-IDF}(w^{[i,j]}, \mathcal {S}^{[i]}) &{} \quad \text {if}\, \mathbb{1}(w^{[i,j]}) = \text {False} \\ 0 &{} \quad \text {if}\, \mathbb{1}(w^{[i,j]}) = \text {True} \end{array}\right. } \end{aligned}$$
(8)

Here, \(\mathbb{1}(w^{[i,j]})\) is a boolean function that verifies if the word \(w^{[i,j]}\) is a single-letter abbreviation. We then apply a threshold to select high-quality datasets. For both tagged and untagged data, we employ a threshold of \(\frac{1}{N}\sum _{\mathcal {S}^{[i]} \in \mathcal {S}} \psi (\mathcal {S}^{[i]})\). After preprocessing and eliminating duplicates, we obtained a dataset of 70, 043 tagged and 1, 208, 484 untagged recipe ingredient expressions.

In the tagged data, we standardize the tags by defining seven categories (see Table 3) to specify the components of ingredient expressions. These predefined tags align with previous works (Shi et al. 2022). As illustrated in Tables 2 and 3, some entities within the corpus occur more frequently than others, indicating a class imbalance issue. This disparity can lead to challenges, necessitating the incorporation of techniques to mitigate its effects when developing an Ingredient NER model.

Table 3 Dataset statistics of the ARTI dataset

Furthermore, we introduce Parts of Speech (POS) tags for ingredient lines, providing additional contextual information to the model. POS tags encompass tokens such as verbs, nouns, adjectives, and others and have proven beneficial in NER tasks (Shi et al. 2022; Eftimov et al. 2015; Diwan et al. 2020). We utilize the flair POS tagger (Akbik et al. 2019) for generating POS tags due to its superior performance relative to other mainstream taggers.

Given this tagged data, we partition the dataset into various ratios to explore the influence of training sample size on the model’s performance. Notably, while unlabeled data is abundant, certain datasets are constrained by limited labeled data. This underscores the need for a semi-supervised approach to harness and learn from unlabeled data. Additionally, it is important to note that the corpus exhibits noise, including spelling errors and irregular positioning of ingredient names, quantities, and comments. These characteristics emphasize the requirement for a versatile model capable of comprehending diverse ingredient expression configurations and resilient to noise and outliers.

6.2 Comparison methods

In order to assess the effectiveness of our proposed model and associated strategies, we conducted a comparative evaluation with current SOTA techniques. The baseline models considered in this evaluation encompass:

  • LG: A straightforward logistic regression model trained on fastText word embeddings.

  • biLSTM-CRF (Huang et al. 2015): This model employs a Bidirectional Long Short-Term Memory (LSTM) architecture to leverage past and future word input features for sequence tagging. It includes a conditional random field layer (Lafferty et al. 2001) to jointly learn word embeddings.

  • FT-biLSTM-CRF: An extension of the biLSTM-CRF model, this version utilizes fastText embeddings.

  • IngredientParser (Shi et al. 2022): This model employs a self-attention framework on a recurrent model to parse ingredient expressions within recipes.

  • BERT (Devlin et al. 2018): BERT, a SOTA pretrained language model, is evaluated in its BERT-base configuration, which includes 12 attention heads and 12 layers. For this assessment, we utilize the Bert-tokenizer as employed in (Wróblewska et al. 2022) and evaluate the model’s tag prediction for word tokens.

These baseline models serve as benchmarks for assessing the performance of our proposed model and strategies in the context of our research.

6.3 Experimental setup

Our study assesses the performance of the proposed model on two distinct datasets: ARTI and TASTEset. In our experiments involving the SINERA model, we set the latent dimension d to 768. We conduct a search over the learning rate parameter lr with values \(5e^{-5}\) and \(2e^{-5}\). For each dataset, we utilize \(10\%\) of the train set as a validation set, and select the best model based on its performance on the validation set. We report the results of the best parameter configuration. We evaluate the proposed model against publicly available baseline models to ensure fair evaluations, making minor adjustments to adapt them to our specific task and input data. We fix the maximum number of training epochs for the proposed model at 20 for the TASTEset dataset and 10 for the ARTI dataset. To ensure robustness and reliability, each experiment is repeated five times with reshuffling, and the average results are reported.

For the Bert-base model obtained from huggingface,Footnote 3 we adopt a dropout rate of 0.2 for 20 epochs in this task. We utilize the AdamW optimizer with a learning rate of \(2e^{-5}\) and a weight decay of 0.01, maintaining default settings for other parameters.

In the case of IngredientParser, we employ a latent dimension of 300, 10 attention heads, a dropout rate of 0.3, and 8 layers. The training of the IngredientParser model spans 100 epochs, using the Adam optimizer with a learning rate of \(5e^{-5}\).

BiLSTM and FT-BiLSTM are configured with a latent size of 300, a learning rate of 0.05, and utilize the AdamW optimizer with a weight decay of 0.01.

FT-LR follows default parameters from the scikit-learn package.Footnote 4

All model implementations are carried out in the Python programming language, and training is conducted on a single NVIDIA A100 GPU. Additionally, we define the maximum ingredient expression word length as 20 and set the maximum number of characters within a word to 20. We implement the model using TensorFlow 2.14. Due to the varying size of the ingredient line input, we utilize padding in implementation. We utilize a pretrained FastText modelFootnote 5 to generate input word level embeddings \(x_{w}^{[i,j]}\) of dimension 300. The POS and character embeddings are randomly initialized and subsequently learned during model training.

6.3.1 Evaluation metrics

In this study, we assess the efficacy of the models by employing the F1 score and overall accuracy score as the performance metrics. The F1 score is defined as the harmonic mean of precision and recall, represented by the formula:

$$\begin{aligned} F1 = \frac{2 \cdot \text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}}. \end{aligned}$$
(9)

Here, precision denotes the ratio of true positive predictions to the total predicted positives, while recall signifies the ratio of true positive predictions to the total actual positives. The macro F1 score is calculated as the harmonic mean of precision and recall for each class, and then averaged across all classes. This macro F1 score is useful for evaluating the overall performance of a classifier across all classes, without bias towards the majority class.

The accuracy score measures the ratio of correctly predicted observations to the total number of observations:

$$\begin{aligned} Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(10)

where TP (True Positives) are the instances that are correctly predicted as positive. TN (True Negatives) are the instances that are correctly predicted as negative. FP (False Positives) are the instances that are incorrectly predicted as positive when they are actually negative. FN (False Negatives) are the instances that are incorrectly predicted as negative when they are actually positive. Our emphasis on positive predictions in the evaluations is driven by our preference for models that correctly infer the ingredient tags.

6.4 Quantitative study

We show and analyze the performance of the supervised and semi-supervised variants of the model. We evaluate the model performance based on the percentage of data used for training. It is worth pointing out that, even though the training percentage varies, the test set is the same across all setups and method evaluations. We first split the tagged data with the 70:30 ratio, where 70% is used for training and 30% as the test set. We then vary the ratio of the train set to use for training. Specifically, when referring to a training ratio of 6%, this denotes the selection of 6% of the full dataset for training purposes. This is akin to a data distribution ratio of 6:64:30, where 64% of the data remains either unutilized for supervised learning or is added to the untagged set for SSL. Therefore, a training ratio of 70% means we use the whole labeled train set for supervised learning. In the semi-supervised setting, we randomly select data samples from the untagged data. Furthermore, we assess the model’s efficacy on an adapted test dataset. To create this test data, we implement the SRM modification on the initial test set, thereby emulating diverse ingredient expression input syntaxes encountered in real-world scenarios (refer to Sect. 3 for details). For easy comparison of the reported results in Tables 4, 5, 6, 7, 8, 9, 10, 11, 12 and 13, the bold fonts denote the best results obtained by the SINERA(S) models, while the blue fonts show the best obtained-results across the baseline models. The scores reported for the categories show the F1 score obtained by evaluating the performance of each category independently.

Table 4 Performance result on the TASTEset test dataset with 30% training dataset. (Color table online)
Table 5 Performance result on the TASTEset test dataset with 70% training dataset. (Color table online)
Table 6 Performance result on the TASTEset adapted test dataset with 30% training dataset. (Color table online)
Table 7 Performance result on the TASTEset adapted test dataset with 70% training dataset. (Color table online)
Table 8 Performance on the ARTI test dataset with 6% training dataset. (Color table online)
Table 9 Performance on the ARTI adapted test dataset with 6% training dataset. (Color table online)
Table 10 Performance on the ARTI test dataset with 30% training dataset. (Color table online)
Table 11 Performance on the ARTI adapted test dataset with 30% training dataset. (Color table online)
Table 12 Performance result on the ARTI test dataset with 70% training dataset. (Color table online)
Table 13 Performance on the ARTI adapted test dataset with 70% training dataset. (Color table online)

Performance assessment on the TASTEset dataset

Tables 4 to 7 present an overview of the model performance on the TASTEset dataset, trained with \(30\%\) and \(70\%\) of the data. Notably, the logistic regression (LR) model exhibited superior performance over several SOTA sequence-based methods across various training data splits. However, as the training dataset size increased, the performance gap reduces, as evidenced in Tables 5 and 7. The superior performance of the LR model over sequence-based models with smaller training datasets can be attributed to the limited sample size of the TASTEset \(30\%\) train sample, comprising approximately 880 samples. Such restricted data fails to capture a comprehensive view of the dataset’s structure, style, and sequential dependencies. Hence, the LR model, as a static learner reliant solely on word embeddings, effectively establishes decision boundaries with fewer samples.

In contrast, other baseline methods are sequential learners aiming to capture the intricate structure of ingredient expressions, necessitating a larger dataset. BERT consistently outperformed all other Baseline methods across all settings, even with limited training data. BERT’s prowess stems from its extensive pre-training on vast datasets, affording a distinct advantage compared to models initialized from scratch, particularly in scenarios where training data is limited.

Our proposed SINERA(S) model surpassed SOTA baseline methods. It exhibits greater stability than other neural sequence-based models and outperforms the LR model. This enhanced performance can be attributed to the model’s increased flexibility, achieved through multi-view embedding and model consistency training.

The semi-supervised variant highlights the benefits of learning from unlabeled data, as explored further in Sect. 6.4.2. The semi-supervised strategy leverages regularization mechanisms to enrich the classifier model’s performance with structural information inherent in the unlabeled data.

When the model is trained in an augmented setting using the strategy outlined in Sect. 5.1, we observe similar or improved performance across all models on the adapted test data. A performance decrease in the test and augmented data for the SINERA model is notable when trained on 30% of the data. This drop can be attributed to the limited train data problem. However, the SINERAS model displays a stable performance as auxillary information is captured using the semi-supervised framework. The introduced augmentation strategy enhances model robustness to various ingredient expression syntaxes, as evident in Tables 6 and 7, where the models are evaluated on adapted test data, resulting in improved performance across all models. In this study, a rigorous statistical analysis was also employed to compare multiple models using the McNemar test (McNemar 1947), adjusted using the Bonferroni Correction method (Dunn 1961). The evaluation focused on the TASTEset dataset, and the pairwise comparisons revealed that the performance of the IngredientParser and FT-BILSTM-CRF models was statistically similar to that of Logistic Regression, with corresponding p-values of 0.5 and 0.6, respectively. Furthermore, examination of the TASTEset indicated that the BERT and SINERA models exhibited statistically similar performance, as evidenced by a p-value of 0.2 based on the common significance threshold of 0.05.

Performance assessment on the ARTI dataset

Utilizing the same evaluation approach as employed with the TASTEset dataset, Table 10 to 13 display model performance when trained on data splits of \(6\%\), \(30\%\), and \(70\%\). Increased training sample sizes resulted in enhanced performance of sequence-based methods compared to Logistic regression on the test data. BERT consistently outperformed other baseline methods, a trend attributed to the advanced capabilities of large language models (LLMs) as discussed in Sect. 6.4.

The second-best-performing baseline method was IngredientParser, demonstrating notable superiority over other baselines. This success underscores IngredientParser’s strength when learning from sequential data where word position is significant.

The model demonstrates enhanced performance across all models when evaluated on the ARTI dataset, showcasing increased robustness and expressiveness. This highlights the model’s potential, especially with larger training datasets. Nonetheless, not all models experienced a significant boost in performance with increased training data, indicating saturation in training data. The number of training samples needed to reach the performance boundary can vary due to problem complexity and data representation. Our findings suggest that, for Ingredient NER, a diverse dataset with a range of ingredients and definitions is crucial.

As discussed in Sect. 6.4, training with data augmentation enhances model robustness. When evaluating the test data samples, the model’s performance closely aligns with that of models trained without data augmentation, albeit with performance improvements observed in some instances. This underscores the model’s stability with the incorporation of augmented data. The assessment of the adapted test data demonstrates enhanced performance, underscoring the benefits introduced by data augmentation during the model training process. Particularly noteworthy is the model’s performance on the augmented dataset when subjected to conventional training, where the IngredientParser outperformed the BERT model on the \(30\%\) and \(70\%\) train data split. This heightened performance of the IngredientParser can be attributed to the additional information derived from the inclusion of Part-of-Speech (POS) tags, a feature that is not leveraged by the BERT model. However, upon undergoing training with data augmentation, the BERT model reclaims its superiority over the baseline methods. In our evaluation, we utilized the McNemar test, as discussed in Sect. 6.4, to compare multiple models. The pairwise comparisons showed that none of the models exhibited statistical similarity w.r.t the ARTI dataset. This observation can be ascribed to the expansion in data size, facilitating the observation of a broader range of variations.

6.4.1 Visual analysis

This section conducts an analysis of class similarities from the model’s perspective and explores the enhanced robustness achieved through semi-supervised training using unlabeled data. For clarity, here we focus on the semi-supervised SINERA model trained on the TASTEset dataset with \(30\%\) train data.

Analysis of label similarities and model robustness The heatmap in Fig. 4 visualizes the model’s performance with columns corresponding to entity tags of words in ingredient expressions. Notably, the semi-supervised model demonstrates a slight advantage in prediction confidence and performance compared to the supervised method. This improved performance is attributed to the model’s heightened discriminative capabilities when auxiliary information from unlabeled data is incorporated into the learning process.

Fig. 4
figure 4

Confusion plot of the model prediction. Showing the prediction confidence for each label

Furthermore, we investigate the similarity between tag labels as perceived by the model. In cases of misclassification, the model occasionally misclassifies tags like Quality, Taste, and Purpose as Food. We hypothesize that this misclassification stems from data labeling. Words associated with these tags are often intertwined with ingredient names. For example, the word sweet in sweet potato can be labeled as part of the ingredient name, as a physical quality of the ingredient, or as Taste. A similar situation arises with the vegetable bitter leaf. In real-world scenarios, tag syntaxes can vary, leading the model to struggle with such edge cases. We identified instances where a word had different tags in different samples of ingredient expressions. This improvement in predictive confidence is prominently illustrated by the intensified coloration in Fig. 4b in contrast to Fig. 4a.

Analysis using GMM in the semi-supervised approach We proceed to analyze the proposed semi-supervised approach based on Gaussian Mixture Model (GMM). Visualizations of word embeddings with the corresponding true tags and inferred mixture components generated by the GMM-based auxiliary model offer insights into the learned auxiliary components. We utilize the t-SNE (Van der Maaten and Hinton 2008) method for plotting, selecting a random set of 1000 ingredient words and plotting their embeddings for clarity in this paper. Our study explores the allocation of components to distinct words within ingredient expressions and how they compare to the true tag structure.

Figure 5 illustrates the model’s ability to learn the component structure of various words. By comparing the component assignments in Fig. 5a to the actual tags in Fig. 5b, we observe basic similarities between the assigned cluster arrangement and the true label structure. This suggests that the generated word segmentation can serve as valuable auxiliary information for the SINERA model, enhancing its predictive capabilities.

Fig. 5
figure 5

Scatter plot showing the cluster assignment and the true class tag of the words in some sampled TASTEset dataset ingredient lines

6.4.2 Model component analysis

In Fig. 6, we present a comprehensive analysis of the model’s components. Our investigation delves into the impact of the multi-view embedding, auxiliary model, weights, and utilization of POS tags. Subsequently, we assess the model’s performance with and without the inclusion of a weighting scheme aimed at mitigating label imbalance effects. Our results affirm that the amalgamation of diverse model components confers substantial benefits to the overall performance. Notably, character-level embedding emerges as a prominent contributor to performance enhancement, surpassing the influence of POS tags. We attribute this phenomenon to the character-level embedding’s unique capability to capture the intrinsic structural characteristics of word construction.

Fig. 6
figure 6

The model component analysis showing the contribution of the model component combination

6.4.3 Parameter sensitivity

The sensitivity of model performance to various parameters is systematically investigated in this section. Specifically, we assess the impact of the semi-supervised coefficient \(\alpha\) and the Gaussian Mixture Model (GMM) component size coefficient, denoted by \(\varrho\).

Semi-supervised coefficient (\(\alpha\)) sensitivity

We access the effects of different values of \(\alpha\) on predictive performance. This sensitivity analysis reveals that the predictive performance remains relatively stable with varying \(\alpha\) values, especially in the presence of larger datasets. However, when working with smaller datasets, some slight performance variations are observed.

Number of GMM components (\(\mathcal {C}\)) sensitivity

This analysis pertains to the impact of the component size coefficient denoted as \(\varrho\) w.r.t clustering. In this study, the tag size is scaled by a factor of k to determine the number of assumed mixture components. Consequently, the number of components in the Gaussian Mixture Model (GMM), denoted as \(\mathcal {C}\), is defined as \(\mathcal {C} = \varrho \times |\mathcal {T}|\), where \(|\mathcal {T}|\) is the cardinality of the tag set. Our observations as shown in Fig. 7 indicate that increasing the component coefficient beyond 2 results in a decline in performance.

Fig. 7
figure 7

The component size analysis with different component coefficients

Number of training epochs sensitivity

Figure 8 shows the impact of the number of training epochs on model performance. Our analysis reveals that exceeding 10 epochs for the ARTI dataset and 20 epochs for the TASTEset dataset leads to no significant performance increase. This analysis helps determine the optimal number of training epochs to prevent overfitting. In our investigation, we have observed that the semi-supervised variant exhibits a reduced epoch requirement in comparison to the supervised variant. This phenomenon can be attributed to its ability to harness the intrinsic data structure present within the unlabeled data during the learning process.

Fig. 8
figure 8

Epoch analysis showing the model performance with different parameters

6.4.4 Memory and computational efficiency analysis

We analyze the time and memory complexities associated with the proposed model in comparison to the widely used BERT architecture. Figure 9 presents a comparative assessment of the number of parameters between BERT and our proposed model. Notably, BERT comprises a significantly larger parameter count in contrast to our proposed model. However, despite this parameter reduction, we achieve superior or equivalent performance to BERT. Furthermore, examining the training time per epoch reveals an efficient characteristic of the proposed model. In this regard, the BERT model demands more time for training when contrasted with both variations of our proposed model. This observation underscores the temporal and computational efficiency gains attainable with our model.

Fig. 9
figure 9

Parameter size and train time per epoch comparison between SINERA and BERT

6.5 Qualitative study

In order to evaluate and contrast the predictive quality of our model against BERT and IngredientParser, we present the model predictions for some sampled ingredient expressions in Table 14. The selection of these comparative models is based on their quantitative evaluation performance.

Table 14 Examples of model predictions to highlight the predictive quality of the SINERA model as compared with BERT and IngredientParser on sampled entities from TASTEset recipe dataset. (Color table online)

In Table 14, the SINERA rows represents the predictions generated by our proposed model for the given ingredient line, the IngredientParser rows represents predictions made by the IngredientParser model, and the BERT rows contains predictions made by the BERT model for the same ingredient expression. Notably, these selected samples encompass a range of lengths and syntax variations. A robust model should be capable of providing accurate predictions regardless of differences in ingredient line syntax or spelling errors. The second to fourth examples in the table illustrate situations where a word (e.g., light, half, and superfine) might be incorrectly tagged due to the model’s inability to discern the precise usage context.

However, we also observed instances where the model struggled to predict the correct tag for certain inputs, particularly when the word in question could be interpreted as part of the ingredient name. For example, in the phrase 3 baking chocolate bars, the term bar might also be considered a component of the ingredient name. When presented with these cases, human evaluators did not consistently agree on the definitive or “correct” tagging for some similar examples. This suggests that certain word tags within the dataset can be subjective, even among humans.

7 Conclusion

In the presented work, our primary focus is on addressing the issue of model stability and the effective utilization of unlabeled data in the context of the ingredient NER problem for food computing. Our approach involves several key components.

We introduce a novel dataset specifically tailored for ingredient NER, which serves as a foundation for our subsequent investigations. We discuss the spurious correlation problem prevalent in current ingredient NER models and present our insights and observations. We hypothesize that the issue can be mitigated by enhancing the existing training methodology through data augmentation. We then proceed to propose and rigorously assess a data augmentation training strategy, demonstrating its capacity to enhance predictive performance when integrated into various established models.

Furthermore, we present a novel SSL framework for ingredient tag predictions, which is grounded in the concept of learning Gaussian mixture components using variational inference. This framework is designed to unveil the inherent data structure and relationships within the dataset. Our evaluation, performed on a baseline dataset, substantiates that incorporating unlabeled data can further enhance the generalization and predictive capabilities of our proposed model.

For future work, an interesting research trajectory includes extending the proposed semi-supervised framework to other domains beyond ingredient NER. This versatile framework can be applied to a wide array of semi-supervised problems, offering exciting prospects for diverse applications, such as image, text, and graph/node classification. Furthermore, we intend to extend our augmentation methodology to address various NER tasks, while also investigating the influence of linguistic variations on the augmentation process. These future endeavors aim to push the boundaries of SSL and data augmentation techniques, contributing to the broader landscape of machine learning and natural language processing.