The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Mirończuk, Marcin Michał

doi:10.1007/s10115-017-1097-2

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Regular Paper
Open access
Published: 20 August 2017

Volume 54, pages 711–776, (2018)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Download PDF

Marcin Michał Mirończuk ORCID: orcid.org/0000-0002-4951-2264¹

4512 Accesses
11 Citations
Explore all metrics

Abstract

The aim of this study is to propose an information extraction system, called BigGrams, which is able to retrieve relevant and structural information (relevant phrases, keywords) from semi-structural web pages, i.e. HTML documents. For this purpose, a novel semi-supervised wrappers induction algorithm has been developed and embedded in the BigGrams system. The wrappers induction algorithm utilizes a formal concept analysis to induce information extraction patterns. Also, in this article, the author (1) presents the impact of the configuration of the information extraction system components on information extraction results and (2) tests the boosting mode of this system. Based on empirical research, the author established that the proposed taxonomy of seeds and the HTML tags level analysis, with appropriate pre-processing, improve information extraction results. Also, the boosting mode works well when certain requirements are met, i.e. when well-diversified input data are ensured.

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

Article Open access 12 April 2022

Leveraging Semantic Search and LLMs for Domain-Adaptive Information Retrieval

Smart literature review: a practical topic modelling approach to exploratory literature review

Article Open access 19 October 2019

1 Introduction

The information extraction (IE) was defined by Moens [48] as the identification, and consequent or concurrent classification and structuring into semantic classes, of specific information found in unstructured data sources, such as natural language text, making information more suitable for information processing tasks. The task of IE is to retrieve important information concerning named entities, time relations, noun phrases, semantic roles, or relations among entities from the text [48]. Often, this process consists of two steps: (1) finding the extraction patterns and (2) extraction of information with use of these patterns. There are three levels of the text structure degree [11]: free natural language text (free text, e.g. newspapers, books), semi-structured data in the XML or HTML format or fully structured data, e.g. databases. The literature sometimes considered semi-structured data, like HTML, as a container of free natural language text. There are many methods for pattern creation [11, 63, 74], e.g. manual or with use of machine learning techniques.

This study briefly presents a general framework of an information extraction system (IES) and its implementation—the BigGrams system. Moreover, it describes (1) a novel proposal of the semi-supervised wrappers induction (WI) algorithm that utilizes the whole Internet domain (website, site, domain’s web pages) and creates a pattern to extract information in context with the entire Internet domain and (2) a novel taxonomic approach and its impact to the semi-supervised WI. This system and the WI are developed to support the information retrieval system called NEKST [17]. The NEKST utilizes the structured results coming from the BigGrams system to improve query suggestions and a ranking algorithm of web pages. Thanks to the BigGrams system, the relevant phrases (keywords) are extracted for each Internet domain.

The BigGrams system analyses HTML web pages to recognize and extract values of a single or multiple attributes of an information system (IS) [52] about, for example, films, cars, actors, pop stars, etc. On the other hand, the main aim of the WI is to create a set of patterns. These patterns are matched to the data, and in this way, new information is extracted and stored in the created IS. The proposed novel WI is based on application of formal concept analysis (FCA) [54, 77] to create extraction patterns in a semi-supervised manner. Thanks to FCA, the hierarchy of chars sequence groups is created. These groups cover the selected parts of the Internet domains. Based on this hierarchy, the proposed WI algorithm (1) selects the appropriate groups from the hierarchy, i.e. the groups that sufficiently cover and generalize the domain’s web pages, and (2) based on these groups, creates patterns that often occur in the Internet domain. The BigGrams system subsequently uses these patterns to extract information from semi-structured text documents (HTML documents). The proposed semi-supervised WI approach consists of the following steps: (1) the user defines a reference input set or a taxonomy of correct instances called seeds (values of the attributes of an IS), (2) the algorithm uses seeds to build the extraction patterns, and (3) the patterns are subsequently used to extract the new instances to extend the seed set [18, 73, 74]. For example, the input set of an actor name Brad Pitt, Tom Hanks can be next extended by the new instances, like a Bruce Willis, David Duchovny, Matt Damon.

The author did not find, in the available literature, similar methods (like the proposed BigGrams) to realize the deep semi-supervised approach to extract information from given websites. There are the shallow semi-supervised methods, such as Dual Iterative Pattern Relation Extraction (DIPRE) technique [5] and Set Expander For Any Language (SEAL) system [18, 73, 74] that obtain information from Internet web pages. These approaches use horizontal (shallow) scan and Internet web pages processing in order to extract appropriate new seeds and create an expanded global set of seeds. The aim of these systems is to expand the set of seeds for new seeds in the same category. In the end, these systems evaluate the global results, i.e. the quality of the extended global set of seeds. The proposed deep semi-supervised approach, as opposed to the shallow semi-supervised method, is based on the vertical (deeply) scans and processing of the entire Internet websites to extract information (relevant instances from these sites) and create an expanded local set of new seeds. In this approach, the number of proper seeds obtained from given websites is evaluated. In this article, the author shows empirically that shallow semi-supervised approach (SEAL is established as a baseline method) is inadequate to resolve the problem of deep semi-supervised extraction, i.e. the information extraction focuses on the websites. The shallow approaches cannot create all required and correct patterns to extract all important and relevant new instances from given sites.

The main objectives of this study are as follows:

establish the good start point to explore IES and the proposed BigGrams system through the theoretical and practical description of the above systems,
briefly describe the novel WI algorithm with the use case and theoretical preliminaries,
establish the impact of the (1) input form (the seeds set and the taxonomy of seeds), (2) pre-processing domain’s web pages, (3) matching techniques, and (4) a level of HTML documents representation to the WI algorithm results,
find the best combination of the elements mentioned above to achieve the best results of the WI algorithm,
check what kind of requirements must be satisfied to use the proposed WI in an iterative way, i.e. the boosting mode, where the output results are provided to the system input.

The author has determined (based on empirical research) the best combination and impact of the above-mentioned core information extraction elements to information extraction results. The conducted research shows that the best results are achieved when the proposed taxonomy approach is used to represent the input seeds and the pre-processing technique, which clears the values of HTML attributes, where the seeds are matched only between HTML tags, and if we use the tags level, rather than the chars level representation of HTML documents. Thanks to these findings, we can construct better WI algorithms producing better results. The proposed system and the WI method have been compared with the baseline SEAL system. Furthermore, the results of the conducted experiments show that we can use the output data (extracted information) as input data of the BigGrams system. It allows the system (when we can ensure well-diversified input data) to be used in an iterative manner.

The presented study is well grounded theoretically to give an in-depth understanding of the proposed method as well as to be easily reproduced. The paper is structured as follows. Section 2 describes various known IE systems. Section 3 presents the formal description of the IES, i.e. it contains the description of IS and the general framework of the IES. Section 4 describes the implementation of the system mentioned above. This section contains (1) the comparison of the baseline SEAL and the proposed BigGrams system, (2) the specification of the proposed BigGrams IES, and (3) the WI algorithm together with the historical and mathematical description of FCA background, and a case study. The experimental results are presented in Sect. 5. Finally, Sect. 6 concludes the findings.

2 State of the art and related work

Currently, there are numerous well-known reviews describing the IESs [53, 66]. Usually they focus on free text or semi-structured text documents [4, 48, 58]. Optionally, the reviewers describe one of the selected components such as WI [11]. There are also many existing IESs. Typically, they are based on a distributional hypothesis (“Words that occur in the same contexts tend to have similar meanings”), and they use formal computational approach [30, 38]. Researchers also constructed another hypothesis called KnowItAll Hypothesis. According to this hypothesis, “Extractions drawn more frequently from distinct sentences in a corpus are more likely to be correct”. [21]. IESs, such as Never-Ending Language Learner (NELL), Know It All, TextRunner, or Snowball represent this approach [1, 3, 6, 9, 10, 22, 23, 56, 59, 68, 78]. The systems mentioned above represent the trend called open IE. They extract information from semi-structured text (HTML documents considered to be containers of natural language text) or natural language text. Also, there are solutions that attempt to induce ontologies from natural language text [20, 37, 49]. The examples of IE for semi-structured texts are described in [26, 32, 34, 50, 55, 61, 76]. In the case of databases [11], the IE can be viewed as an element of data mining and knowledge discovery [45]. There are also many algorithms that implement the WI component of IES [11].

Schulz et al. [60] present the newest survey of the web data extraction aspects. Their paper describes and complements the most recent survey papers of authors like Ferrara et al. [24] or Sleiman and Corchuelo [61]. Furthermore, we can add three articles of Varlamov and Turdakov [72], Umamageswari and Kalpana [71], and Chiticariu et al. [13] as a complement to Schulz et al. survey. In these studies, the authors focus on the description of the methods, vendors, and products to IE or WI. Furthermore, all the papers mentioned above describe the IE problem using different perspectives, such as the level of human intervention, limitations, wrapper types, wrapper scope. In the author’s research point of view, the best division of the IE approaches is based on the techniques used to learn WI component. We may distinguish three techniques: supervised, semi-supervised, and unsupervised. The supervised methods require manual effort of the user, i.e. the user must devote some time to label the web pages and mark the information to extraction [34,35,36, 64, 69]. The unsupervised methods, on the other hand, start from one or more unlabelled web documents and try to create patterns that extract as much prospective information as possible, and then the user gathers the relevant information from the results (Definition taken from Sleiman [64]) [12, 15, 39, 43, 62, 63, 65].

The semi-supervised technique is an intermediate form between supervised and unsupervised methods. In this approach, we only create a small dataset of seeds (a few values of the IS attribute) rather than create data set of labelled pages. There are three well-known IESs that are based on the semi-supervised approach to WI and processed on the web pages, i.e. Dual Iterative Pattern Relation Extraction (DIPRE) technique [5], Set Expander For Any Language (SEAL) system [18, 73, 74], and similar to SEAL the Set Expansion by Iterative Similarity Aggregation (SEISA) [31]. There are several advantages of these approaches, namely (1) they are language independent, (2) they can expand a set of small input instances (seeds) in an iterative way with sufficient precision, and (3) they discover the patterns with almost no human intervention. The SEAL represents a more general approach than DIPRE. The DIPRE extracts only information about books (title and author names). The SEAL can extract unary (e.g. actor name(Bruce Willis)) and binary (e.g. born-in(Bruce Willis, Idar-Oberstein)) relations from the HTML documents. In the first case, the extracted instance would be Bruce Willis, in the second Bruce Willis/Idar-Oberstein. Due to the more general form of the SEAL, as compared to the DIPRE, and thanks to its ability to reproduce, as compared to the SEISA, the author decided to compare the BigGrams system against the SEAL as a baseline.

Finally, it is worth mentioning one of the few obstacles that relate to the available, well-labelled and large gold-standard data sets and tools to IE [60]. The author can confirm the observation of the Schulz et al. [60] that it is difficult to compare the results of the IE solutions. The promising changes in this area are the large and well-labelled data sets created by the Bronzi et al. [7], Hao et al. [28] (this set was used in the additional benchmark, See “Appendix A”), as well as the original data set which was created by the author (see Sect. 5.1).

3 Formal description of the information extraction system

Usually, IES uses data that are received or have been transformed from the input of the IS. Also, the semi-supervised WI algorithms use information from some kind of the IS. For this reason, the author assumes that it is important to formally define the term IS to better understand the rest of this article and its role in IES. Sect. 3.1 describes theoretical basis of IS with the technical details. Section 3.2 explains the general framework of the IES.

3.1 Theoretical preliminaries

According to Pawlak [52] in each IS, there can be identified a finite set of objects X and finite set of attributes A. Each attribute a belonging to the A is related to its values collection $V_a$, which is also known as a domain attribute a. It is accepted that the domain of each attribute is at least a two-element, i.e. each attribute may take at least one of the two possible values. Clearly, some attributes may have common values, e.g. for the attribute length and width set of values are real numbers. The binary function $\varrho $ is introduced to describe the properties of the system objects. This function assigns the value v belonging to the domain $V_a$ for each object $x \in X$ and attribute $a \in A$. By information system is meant quadruple:

$$\begin{aligned} { IS} = {<}X, A, V, \varrho {>} \end{aligned}$$

(1)

where X, a non-empty and finite set of objects; A, a non-empty and finite set of attributes; $V = \bigcup _{a \in A}{V_a}$, $V_a$ domain attribute a, a set of values of the attribute a; $\varrho $, the entire function, $\varrho : X \times A \rightarrow V$, wherein $\varrho (x,a) \in V_a$ for each $x \in X$ and $a \in A$.

The domain $V_a$ attribute a in IS is a set $V_a$ described as follows:

$$\begin{aligned} V_a = \{v \in V{:}~{ for~each~exist}~x \in X,~{ such~as}~ \varrho (x,a)=v\} \end{aligned}$$

(2)

3.1.1 Practical preliminaries

The name of the IS (films, cars, etc.) is usually a general concept that aggregates a combination of attributes and their values. For example, the name movies is a concept that may contain attributes like the film title, actor’s name and production year.

In the remainder of this article, the author used a shortened notation to describe the IS, i.e. we can treat the IS as an n-tuple of attributes and their values: ${ IS}\text{< }X = \{x_1,\ldots , x_{|X|}\},~{ attribute}\text{- }{} { name}\text{- }1 = \{{ value}~1~{ of~ attribute}~1,~{ value}~2~{ of attribute}~1,\ldots \},\ldots , a \in A = V_a{>}$.

In the rest of this article, the following types of a tuple will be used: monad (singleton) and n-tuple. The monad tuple is an IS having one attribute $|A| = 1$, ${ IS}\text{< }a \in A = V_a{>}$ and ${ is}\text{- }a(V_a, a \in A)$ or $[a \in A](V_a)$. For example, IS<film title = {die hard, x files,...}>, IS<actor name = {bruce willis, david duchovny, ...}> and is-a(die hard, film title) or film title(die hard). N-tuple is an IS with n-attributes $|A| > 1$.

Finally, describing the IS, it is worth noting that the attributes can be granulated. The attribute values can be generalized or closely specified so that attribute taxonomies can be built. Assume the attribute $a \in A$ and a set of its values $V_a$. This attribute can be decomposed into $a_1$ and $a_2$ such that $V_{a_1} \cap V_{a_2} = \emptyset $. Then, it is possible to connect the attributes’ values $V_{a_1} \cup V_{a_2} = V_a$. The first action is defined as specification of attribute values. The second action is defined as generalization of attribute values. For example, the attribute film title can be split into two separate attributes film-tile-pl (Polish film title) and film-title-en (English film title), which then can be re-connected to receive a film title set of attribute values. Of course, it is not a general rule that $V_{a_1} \cap V_{a_2} = \emptyset $; for example, the attribute person name can be split into musician name and actor name attributes. And it is obvious that there are actors who are musicians and vice versa, for example Will Smith. However, the assumption on the input data set that $V_ {a_1} \cap V_ {a_2} = \emptyset $ has a positive effect on the results obtained from the proposed semi-supervised method of IE.

3.2 The general framework of the information extraction system

The author considered the whole process of an IE. This section describes all the components of an IES, regardless of the analysed data structure. It was assumed that an HTML document can be treated as a structure that stores free text (<p>free text</p>) or provides hidden semantic information. The HTML tag layouts (in short, HTML layouts) define this information (<h1 style=“actor name”>Bruce Willis</h1>). Natural language processing (NLP) algorithms are used in the first case to process free text. These tools are used to locate the end of a sentence and to grammatically analyse the sentences, etc. In the second case, the WI algorithms are used to analyse the structure of the HTML tag layouts. The created wrappers extract relevant information from these layouts. Figure 1 shows the basic components of the IES.

Figure 1 shows the IES pipeline. We can consider the task of the IE as a realization of reverse engineering. Usually, we try to restore a full or partial model of an IS based on free text or HTML tag layouts. We can divide this task into two subtasks. The first subtask relates to the creation of an information system scheme (defining the IS attributes and possible relationships between them). While the second subtask relates to the attribute values extraction of the created IS schema. The IS schema that contains the attributes and values assigned to them is in short called IS.

The presented process in Fig. 1 gets an input data set, which is a collection of documents (corpus) P. It contains HTML documents $p \in P$, which belong to a domain d. The domain comes from the set of domains D ($d \in D$). The unknown domain process ${ process}_{d}$ has created these documents. The process connects information from an unknown ${ IS}_d$ with the HTML layout $L_{d}$ and noise $T_{d}$. The HTML layout defines a presentation layer. The presentation layer displays information from the hidden ${ IS}_d$. Noise is created by the dynamic and random elements generated in the presentation layer. For example, the domain of movies can contain a simple IS, which consists of the following attributes and their values ${ IS}_{d}<$ film title = {x files,x files}, actors = {david duchovny, gila anderson}, comments = {comments body 1, comments body 2}>. The HTML layout might look like <h1 class=“film title”>$film title</h1 $> <$ br/> Actors<ul $> <$ li>$actors $_{1}<$/li $> <$ li>$actors $_{2}<$/li $> <$/ul> ul ${>}\cdots {<}$ div class = “todays comments”>random(comments)</div>. Noise, in this case, is generated by the random() function, which returns a random comment. An output document from this process will have the following form <h1 class = ’film title’>x files</ h1 $> <$ br/>Actors<ul $> <$ li>david duchovny</li $> <$ li>gillian anderson</li $> <$/ul $> <$ul> ...<div class = “todays comments’>comments body 2</div>. It should be noted that the same information can be expressed using a free text embedded in HTML layout <p>The film’s title is “The X-Files”. In this film, starring actors are david duchovny and gillian anderson. Selected a random comment at the premiere, which I heard was as follows: description of the body 2.</p>.

In Fig. 1 an Induction of the information system schema component creates the IS schema. The schema contains attribute names and relationship between them. A software engineer who creates the IS can induce this schema based on manual analysis. The software engineer may also use other IES components to induce attribute names and relationships between them. Thanks to these components, we may extract the attribute names and values as well as relationship between attributes.

In Fig. 1 an Identification of parts of the content of web pages to the information extraction component is used to mark important sequences of HTML tags or text in a document. The WI algorithm creates patterns based on these markings. We may mark documents using the supervised, semi-supervised, or unsupervised methods. Eventually, while viewing documents, we may manually identify important parts of the documents and immediately create the patterns (brute force method) or directly save them to the IS. Thanks to this, we can omit the WI component and we can go directly to the Pattern matching or Save the matched information to information system component. The supervised method is also based on manual marking. In this method, we can also mark important parts of documents. However, we do not create the patterns and after that we do not omit the WI component. The semi-supervised way involves creation of an input set of seeds (the attribute values of the IS). This set does not necessarily depend on the marked documents. However, this set must contain the seeds that can be matched to the analysed documents. It is the necessary condition for the WI. In the unsupervised identification, the same algorithm identifies important elements of the document. After marking the whole or some parts of the documents, we can go to the Wrappers induction component.

In Fig. 1 the Wrappers induction component is used to create the patterns. Based on the marked sequences of documents from the Identification of the parts of the content of web pages to the information extraction the WI algorithm creates the patterns. Next, the created patterns are saved in the data buffer (a memory, a database, etc.).

In Fig. 1 the Pattern matching component is used to match the created patterns. This component takes the patterns from the data buffer. After that, a matching algorithm matches these patterns to other documents. In this way, the component extracts new attribute values. The Save the matched information to information system component saves these new attribute values into the IS. Depending on the type of an analysis, i.e. induction of the attribute names or relations names of the IS, or extraction of the attribute values, information can be stored in an auxiliary or destination IS, which has the established IS schema. Of course, we may perform a manual identification of important information while viewing documents, and save this information directly to the selected IS (the Wrappers induction and the Pattern matching components are skipped).

In Fig. 1 the Verification component is optional. We may use it to validate the extracted attribute values, attribute names, or relation names. Such verification of facts may be based on external data sources, e.g. an external corpus of documents [8, 16].

In Fig. 1 the boosting phase is an optional element. Extracted information (verified or not), depending on the type of analysis, can be redirected to the Induction schema of the information system or Identification of the parts of the content of web pages to the information extraction component.

In Fig. 1 the last Evaluation component is optional. We may use this component to verify the entire process or individual components. For example, we may evaluate the WI algorithm or the component to verify facts collected in the IS, etc.

4 BigGrams as the implementation of the information extraction system

Section 4.1 describes the comparison of the BigGrams and the SEAL systems. Section 4.2 explains the specification of the proposed IES. Section 4.3 describes the algorithm and its use case.

4.1 The comparison of BigGrams and SEAL systems

The BigGrams system is able to extract unary and binary relations, and it is partially similar to the SEAL. The main differences between the BigGrams and the SEAL are as follows. The data structure used in the BigGrams is a lattice instead of Trie data structure utilized in the SEAL. Also, the BigGrams system uses a different WI algorithm. The term lattice comes from FCA, which is also applied in many other disciplines, such as [77] psychology, sociology, anthropology, medicine, biology, linguistics, mathematics, etc. The author did not find approaches based on mathematical models called FCA to build the WI algorithm from HTML documents in the available literature. Another difference concerns the method of document analysis. In the SEAL, the WI algorithm creates the patterns on the level of a single page and a chars level. In the BigGrams, the WI algorithm recognizes the whole domain of documents as one huge document. Based on this document, the BigGrams creates a set of patterns and attempts to extract all important and relevant instances from this domain (high precision and recall inside Internet domain). In contrast, the SEAL retrieves instances by using every single HTML document from the whole Internet and attempts to achieve high precision. The SEAL also uses a rank function to rank the extracted instances. This function filters, for example, the noise instances. The BigGrams does not use any ranking function. Furthermore, from the point of view of the extraction task (the extraction of relevant instances from domains), the SEAL is not the appropriate tool to accomplish this task because of low recall.

Like in the SEAL, the WI algorithm of the BigGrams can use a sequence of characters (raw chars, chars level, raw strings, strings level, or strings granularity) [47]. This algorithm extracts these strings from the HTML document and uses them to create patterns. However, the author noticed that it is better to change these raw strings to the HTML tags level (HTML tags granularity). This article presents the results of this change.

In contrast to the SEAL, the WI algorithm of the BigGrams could use a more complex structure to the WI. The BigGrams may use the taxonomy of seeds rather than a simple set of seeds (a bag of seeds). The bag of seeds contains input instances (seeds) without semantic diversity. The taxonomy includes this semantic diversity.

Furthermore, the author introduced a weak assumption that based on the input values of the IS, the extracted patterns will extract new values belonging to the attribute of a given IS. This is a weak assumption because the created pattern can extract a value belonging to another IS. It occurs when based on values from ${ IS}_1$ and values from ${ IS}_2$ (disjunction values set), the WI algorithm will create the same pattern that covers values from ${ IS}_1$ and ${ IS}_2$. In the algorithm output, it cannot be recognized which values belong to which IS. Despite this drawback, this approach significantly improves the performance of the proposed WI algorithm. It has been proven experimentally and described in this article.

Finally, it is worth mentioning that the BigGrams, such as the SEAL, does not operate in “live DOM” where all CSS (e.g. CSS boxes) and JavaScript are applied to a page. Moreover, also the dynamic elements of HTML 5 are omitted in the WI phase. The BigGrams processes only the static rendered web pages.

4.2 Specification on high level of abstraction

The aim of the BigGram system, for a given particular domain, is to extract only information (new seeds, new values of attributes) connected with this domain. For example, for a domain connected with movies, the BigGrams should extract the actors’ names, film titles, etc. To this end, the BigGrams analyses all the pages $p\in P$ from the domain $d\in D$. Based on this analysis, the BigGrams creates patterns for the whole domain and extracts new seeds. Figure 2 shows the general scheme of the BigGrams system.

The input of the BigGrams system accepts two data sets (Fig. 2). The first data set may include a set of seeds (bag of seeds) or a taxonomy of seeds. The second data set contains the domain’s web pages (three examples of web pages are shown in Figs. 4, 5, and 6). The bag of seeds contains input instances (seeds). Alternatively, all values (instances) are assigned to one attribute (thing name) of a singleton IS. We may split this attribute into several attributes that may store values that are more relevant to them (by a semantic view), i.e. we can create a taxonomy of seeds. The values are assigned to semantically appropriate attribute names. Let us consider the the bag of seeds in the following form IS<(thing name = {district 9, the x files, die hard, bruce willis}). The thing name attribute could be split into more specific attributes, such as english film title and actor name. In this way, we can create separable input data sets (singleton ISs). All the IS contains specific attributes that capture the well meaning (semantics) of their values. For bag of seeds, mentioned above, we can create separable ISs such as, IS<english film title = {district 9, the x files, die hard}> and IS<actor name = {bruce willis}>. The output of the BigGrams system contains an extended input set. The system realizes the following steps: (1) creates patterns based on the input pages and seeds, (2) extracts new instances from these pages by using the patterns, and (3) adds new instances to the appropriate data set. Furthermore, the system may work in the batch mode or the boosting mode. The batch mode does not operate in an iterative way and does not use the output results (the newly extracted seeds) to extend input seeds. The boosting mode includes these abilities.

4.2.1 Specification details with examples

Figure 3 presents the more elaborate scheme of the BigGrams system.

The process presented in Fig. 3 shows the basic steps of the BigGrams system. Firstly, a set of IS schemes has to be created manually. Each IS schema consists of common attributes, such as id-domain (a domain identifier), id-webpage (a web page identifier). Also, each IS schema contains an individual attribute (not shared/common between IS schemas), such as film-title-pl, film-title-en, actor name. After the phase of patterns matching, the acquired information is saved in particular IS schemas.

In the second step, a collection of documents for each domain (Domain’s web pages) is gathered from a distributed database (DDB). After this step, an initial document processing (Pre-processing) takes place. This processing:

fixes the structure of an HTML document (closes the HTML tags, closes the attribute values using the chars ”, etc.),
cleans an HTML document from unnecessary elements (header, JavaScript, css, comments, footers, etc.),
changes the level of granularity of HTML tags.

The HTML tags contain attributes and their values, for example <h1 attribute1 = “value1” attribute2 = “value2”>. We may change the granularity of HTML tags by removing the value of the attributes or by removing the attributes and their values. For example, we can express the above-mentioned HTML tags as:

${<}$ h1 attribute1 = “ attribute2 = ”> - the HTML tags without attribute values,
${<}$ h1> - the HTML tags without attributes and their values.

The semi-supervised identification performs matching the seeds (the attribute values of a singleton IS) for each processed HTML document. We create the input set of seeds for each created scheme of a singleton IS manually. After matching the seed to the document, the n-left and m-right HTML tags that surround the matched seed are collected. Based on these, we can create a set of triple data $t_{d}$ <n-left HTML tags, matched seed, m-right HTML tags>. Based on this set, the algorithm creates one global big document. This document can be considered as a collection of aforementioned data triples $t_{d}$.

The Wrappers induction step contains the embedded element called Pre-processing of wrappers induction. This element processes the HTML documents before performing wrappers induction. This component is responsible for outlier seeds detection and filtration. It detects and removes the outlier data triples $t_{d}$ from the big document. The previous experiments [47] have shown that $t_{d}$ can be found in the data set that contains the seeds that contribute to the induction of patterns in a negative way. Usually, there are seeds and $t_{d}$ that occur on the domain’s HTML document too often or too rarely. This component removes too frequent or too rare seeds from big document, i.e. seeds of frequency below $Q3 - 1.5 \cdot ({ IQR})$ or above $Q1 + 1.5 \cdot ({ IQR})$ (Q1—first quartile, Q3—third quartile, ${ IQR}$—interquartile range). Also, the approach of sampling only k random $t_{d}$ from the big document is used, if it is too long, i.e. when it contains more than p triple data $t_{d}$. After the outlier seeds detection and filtration the WI algorithm creates the patterns based on the one global big document. The author defines a pattern as a pair which contains the left l and the right r contextual HTML tags.

The tags that surround the matched seeds from the triple data are defined as a left extension or a right extension. Respectively, the left and right extensions have fixed lengths. The length is expressed by the number of HTML tokens. For example, we may assume the input as the IS singleton IS<film-title-en = {the x files, die hard, the avengers, district 9}>. The input seeds set $V_{a={ film}\text{- }{} { title}\text{- }{} { en}}$ consists of four seeds (values of the attribute film-title-en) $|V_{a={ film}\text{- }{ title}\text{- }{} { en}}| = 4$ and $V_{a={ film}\text{- }{ title}\text{- }{} { en}}=\{{ the}~x~{ files},~{ die}~{ hard},~{ the~ avengers},~{ district}~9\}$. Also, it is assumed that the domain $d\in D$ is represented by a set P of three documents $P=\{p_1, p_2, p_3\}$. The contents of the three pages are shown in Figs. 4, 5, and 6, respectively.

We can match the seeds $s\in V_{a={ film}\text{- }{ title}\text{- }{} { en}}$ to the documents $p_1-p_3$, and in this way we can retrieve the set of data triples $t_{d}$. Next, we create a big document that connects the all triple data $t_{d}$. The HTML tokens of the triple data have the fixed left $k_l$ and the right $k_r$ lengths. Figure 7 presents the big document for the $k_l=3~{ and}~k_r=2$ lengths.

After creation of the big document, each seed from the data triple $t_{d}$ obtains its unique id $o_i$, $i=1,\ldots , y,~{ where}$ $y~{ is~a~counts~of~all}~t_{d}$ ($\{o_1 = { the}~x~{ files}, o_2 = { die~hard}, o_3 = { the~avengers}, o_4 = { the}~x~{ files}, o_5 = { district}~9$, $o_6 = { die hard}, o_7 = { the avengers}\}$). In addition, each object $o_i$ is associated with the identifiers (indexes) of web pages. Thanks to this, we know which page contains a specific object. The WI algorithm creates extraction patterns for the domain based on such created big document and with the use of FCA (Sect. 4.3).

In Fig. 3, it can be noticed that the Wrappers induction phase is followed by the Pattern matching phase and the Update/Save phase. In the Patterns matching phase, the patterns are subsequently used to extract new instances. In this phase, the instances between left and right HTML tokens are extracted. Based on the previously considered example, the WI algorithm may create a general pattern, like <p class=“film title” $> <$ br/${>}(.{+}?){<}$ br/$> <$/p>. After applying this pattern to documents, two new instances of the attribute of the film title will be extracted: nine months and dead poet society. Thus, the initial input set of seeds is extended by two new instances of the attribute of the film title. In the Update/Save phase, these new values of the attribute are saved into the singleton IS or the appropriate input data set is updated. Furthermore, we can use a boosting mode to improve the output collections of instances. The received output instances can be directed back to the input of the semi-supervised identification phase.

4.3 Implementation

This section describes the FCA theory (Sect. 4.3.1) which is a core of the proposed WI algorithm. Moreover, the algorithm with the use case is described in Sect. 4.3.2.

4.3.1 Theoretical preliminaries

Rudolf Wille introduced FCA in 1984. FCA is based on a partial order theory. Birkhoff created this theory in the 1930s [54, 77]. FCA serves, among others, to build a mathematical notion of a concept and provides a formal tool for data analysis and knowledge representation. Researchers use a concept lattice to visualize the relations among the discovered concepts. A Hasse diagram is another name of the concept lattice. This diagram consists of nodes and edges. Each node represents the concept, and each edge represents the generalization/specialization relation. FCA is one of the methods used in knowledge engineering. Researchers use FCA to discover and build ontologies (for example, from textual data) that are specific to particular domains [14, 44].

FCA consists of three steps: defining the objects O, attributes C, and incidence relations R; defining a formal context K in terms of an attribute, object, and incidence relation; and defining a formal concept for a given formal context. The formal context K is a triple [27]:

$$\begin{aligned} K{<}O,C, R{>} \end{aligned}$$

(3)

where O, the non-empty set of objects; C, the non-empty set of attributes; R, the binary relation between objects and attributes; orc, the relation r representing the fact that an object o has an attribute c.

From the formal context K the following dependencies can be derived: any subset of objects $A\subseteq O$ generates a set of attributes $A'$ that can be assigned to all objects from A, e.g. $A = \{o2, o3\} \rightarrow A' = \{c2, c3\}$ and any subset of attributes $B\subseteq C$ generates a set of objects $B'$ that have all attributes from B, e.g. $B = \{c2\} \rightarrow B' = \{o2, o3\}$.

The formal concept of the context K(O, C, R) is a pair (A, B), where [27]: $A = B'= \{ o \in O : \forall c \in B \, orc \}$—extension of (A, B) and $B = A' = \{ c \in C : \forall o \in A \, orc \}$—intension of (A, B).

With each concept there is a related extension and intension. The extension is the class of objects described by the concept. The intension is the set of attributes (properties) that are common for all objects from the extension. The concepts (A1, B1) and (A2, B2) of the context K(O, C, R) are ordered by the relation that can be defined as follows [27]:

$$\begin{aligned} (A_1, B_1) \le (A_2, B_2) \iff (A_1 \subseteq A_2 \iff B_2 \subseteq B_1) \end{aligned}$$

(4)

The set of all concepts of S of the context K together with the relation $\le (S(K), \le )$ constitutes a lattice called concept lattice for the formal context K(O, C, R) [27].

4.3.2 The wrapper induction algorithm and the use case

The algorithm presented below has three properties. Firstly, it suffices to scan the set of input pages and the set of seeds only once to construct the patterns. Secondly, the patterns are constructed with the use of concept lattice described in Sect. 4.3.1. The pattern construction consists of finding a combination of left l and right r HTML tokens surrounding the matched seeds that make it possible to extract new candidates for seeds. Thirdly, the algorithm has parameters to control its performance, e.g. precision, recall, and F-measure. One of such parameters is the minimum length of the pattern, which is defined by the minimum number of left l and right r HTML tokens that surround the seed.

Now, it will be described how the left and right lattices are constructed based on the big document (Sect. 4.2.1). There is a constructed appropriate relation matrix that next serves for constructing left and right concept lattices. The matrix for building the left (prefix) lattice is shown in Table 1. The resulting lattice is shown in Fig. 8. The right (suffix) matrix and lattice are built analogously.

Table 1 An example of the cross-table

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Abstract

Similar content being viewed by others

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

Leveraging Semantic Search and LLMs for Domain-Adaptive Information Retrieval

Smart literature review: a practical topic modelling approach to exploratory literature review

1 Introduction

2 State of the art and related work

3 Formal description of the information extraction system

3.1 Theoretical preliminaries

3.1.1 Practical preliminaries

3.2 The general framework of the information extraction system

4 BigGrams as the implementation of the information extraction system

4.1 The comparison of BigGrams and SEAL systems

4.2 Specification on high level of abstraction

4.2.1 Specification details with examples

4.3 Implementation

4.3.1 Theoretical preliminaries

4.3.2 The wrapper induction algorithm and the use case

5 Empirical evaluation of the solution

5.1 The description of the reference data set

5.1.1 Practical preliminaries

5.2 The indicators to evaluate the proposed solutions

5.3 The plan of the experiment

5.4 The realization of the experiment plan and the results

5.4.1 The batch mode

5.4.2 The boosting mode

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

A Another empirical evaluation of the solution

A Another empirical evaluation of the solution

1.1 A.1 The description of the reference data set

1.2 A.2 The indicators and methods to evaluate and comparison of the solutions

1.3 A.3 The plan of the experiment

1.4 A.4 The realization of the experiment plan and the results

1.5 A.5 Summarization

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation