Data modeling and NLP-based scoring method to assess the relevance of environmental regulatory announcements

The constantly growing body of global environmental legislation necessitates that corporate environmental compliance managers frequently assess the relevance of new regulations and regulation revisions for each of their sites. Companies are pressured to streamline and automate this crucial task through digital workflows and specialized IT-based assistance systems. This has recently piqued the interest of researchers working in different disciplines, such as intelligent systems, machine learning, and natural language processing. The article describes the latest results of our long-term research program on IT-based support for corporate compliance management, offering insights for these, and other disciplines. The context and the main aspects of environmental regulation announcements and the relevance assessment task are analyzed. An extensive conceptual data model is developed that serves as a foundation for tailoring a generic method to perform a relevance assessment that considers site-specific individual environmental compliance facts. The method uses heuristic data operations and various text processing techniques from the field of natural language understanding. In order to exemplify the method, two application scenarios are described in which the relevance of new waste management directives are assessed for a multi-site production company.


Introduction
Announcements of new entities of environmental legislation, such as laws, acts, and directives, referred to in the following as 'environmental regulations' or just 'regulations' and announcements of revisions of already existing regulations must be continuously monitored. Among other corporate environmental compliance management (CECM) duties, this monitoring task is a central obligation for all business organizations. Whenever a new regulation or revision is announced, the relevance for the firm must be assessed. Any relevant regulation and its revisions need to be documented together with respective enforcement measures for auditing purposes and compliance checks. Companies are recommended to make use of a central regulation registry (Thimm 2015) to streamline operational compliance management tasks. The key data administered in a regulation registry are regulations, revisions, the results of relevance assessments of regulatory announcements, and measures that target enforcing compliance.
The accurate assessment of regulatory announcements and the provision of an up-to-date regulation registry can be perceived as a crucial requirement (Campbell Gemmell and Marian Scott 2013) that needs to be fulfilled to achieve full compliance with environmental legislation. Clearly, a firm's environmental compliance situation must be viewed as a continuous time-varying state (Thimm 2017b). Transitions from a (positive) compliance state into a noncompliance state where environmental legislation is violated may occur for many different reasons, including breakdowns of compliance enforcement measures, malfunctioning of infrastructure and equipment, human errors, organizational deficiencies, limited expertise, limited trust in the governmental compliance enforcement systems, sabotage acts, and environmental crime (White and Heckenberg 2012). Note in this context that any revision of product properties, production processes, and material logistic routines and involved 1 3 equipment need to be carefully evaluated in terms of potential compliance conflicts.
A recent research report of Good Jobs First, a nonprofit organization based in Washington, DC, (Mattera and Baggaley 2021) p. 6, describes the following: 'Over the past two decades, state regulatory agencies and attorneys have generally brought more than 50,000 successful enforcement actions against private sector entities for violations of clean air, clean water, and other environmental laws. Looking at cases with penalties of $5000 or more, the states have collected about $21 billion in fines, settlements, and other payments.' In the context of public debates of such numbers, industry associations often draw attention to the regulation density and the dynamic of environmental legislation, which has been growing dramatically during the last decades. It is argued that this trend, which is expected to become even stronger in future, has led to a highly complex and heterogeneous body of environmental legislation that imposes severe problems on the business world. In particular, multinational companies with many different production sites, supply chain partners, and customers around the world are already challenged today and even more in the near future by global environmental legislation, which is constantly being extended and revised by many different rule setters at various levels, including municipalities, counties, states, countries, and supra-national organizations.
Difficulties in finding people with suitable skills for the complex set of CECM tasks and budget constraints are reasons that some companies have outsourced or out-tasked the monitoring and relevance assessment of regulatory announcements. Typical contract partners in practice are law firms, environmental consultancies, and highly specialized software companies that offer curated environmental legislation content.
Regardless of whether the announcement monitoring and the relevance assessment is performed in house by corporate environmental compliance managers or completed by contractors, in general, IT-based assistance of these tasks may make it easier for firms to handle the challenges described above (Thimm 2017a). The research described in this article aims to investigate IT-based assistance for announcement monitoring and relevance assessments and to pioneer and test respective assistance tools. We do not target what is in legal informatics often referred to as 'legal machines' (Cyras and Lachmayer 2014), but our approach bears some commonalities to these works, particularly regarding the aspect of legal subsumption.
A generic data model is presented that combines both the modeling of domain knowledge and the modeling of the specific CECM context of firms. The data model is intended to serve as the foundation for a generic assistance system that computes relevance scores for new regulations. A second obvious building block of the targeted assistance system is a tailored scoring method that computes accurate relevance scores for the firm's sites that may need to comply with different sets of environmental regulations. This requires the method to carefully address the actual activity profile and the prevalent regulatory situation for each site. A first version of such a scoring method is proposed on the basis of the data model. The method combines intuitive analysis steps and analysis steps that apply standard text analysis techniques from the field of natural language processing (NLP). Two application scenarios are described in order to exemplify the principles of the method and to demonstrate the validity of it. In each scenario it is assumed that a new waste management directive has to be assessed for a multi-site production company. The sample data used to describe the assessment steps of the method are largely based on the real-world CECM data provided by an industry partner. The article proceeds as follows. Related work and an overview of the text analysis techniques considered in this work are described in Sect. 2 and Sect. 3, respectively. The main aspects of the regulatory announcements are investigated in Sect. 4. The data model is introduced in Sect. 5. An overview of the proposed scoring method is given in Sect. 6. Also, in Sect. 6 the application scenarios are briefly described, while corresponding detail data are given in the appendix. Concluding remarks are contained in Sect. 7. Butler (2011) published theories to explain how green information systems can support organizational sense making, decision-making, and knowledge sharing. The work includes a conceptual model of the process of regulatory compliance gathering and the process of compliance decision-making. Several similarities between the Butler model and the concepts proposed in this research can be found. Butler looks at the processes as a whole abstracting from implementation aspects. In contrast, this research investigates specific decision tasks of these processes and describes a solution approach for implementation.

Related work
In an Irish case study (Butler and McGovern 2012), the company Napa Inc. was analyzed concerning major CECM issues. The researchers explored the fundamental compliance processes and challenges that firms face in their CECM practice in general, and in particular, with respect to the use 1 3 of ICT for CECM tasks. The study is focused on product compliance, while this research targets compliance across all business functions, including manufacturing, logistics, supply chain management, and the interference of a firm's physical work space with the environment. Additionally, generic information system (IS) solutions for CECM are analyzed and a process-based conceptual model for CECM and a conceptual architecture of a CECM IS called the 'compliance knowledge management system' are described. Similar aspects of CECM research have been addressed by a research group at Pforzheim University (Thimm 2015). The group proposed a comprehensive process model and an information system approach for CECM. The more recent work of the same group discussed a conceptual framework for cloud-based assistance of CECM practitioners (Thimm 2018).
A reference model for an environmental management information system for compliance management that makes comprehensive use of business intelligence concepts has been proposed by Freundlieb and Teuteberg (2009). Kerrigan (2003) of Stanford University proposed a software infrastructure that offered assistance for CECM tasks based on semantic technologies. A research group of IBM developed an approach for compliance automation through the use of event monitoring rules (Giblinet al. 2006). Wizards for conveying environmental information and helping people complete environmental management tasks, for example, are described in (Braun et al. 2004).
We also searched for related work in the research literature on business process management and decision modeling. The results of these search efforts suggest that the CECM field has not been at the focus of business process management research thus far. Most articles address corporate sustainability management at a strategic level and thus, focus on higher-level perspectives, such as the business case level (Schaltegger et al. 2012) or the business model level (Geissdoerfer et al. 2018). It appears that work on decision modeling for the domain of corporate sustainability management has largely not addressed issues of the CECM field.
In recent research studies advanced NLP methods and Machine Learning have been applied to extract specific knowledge items from text documents of the construction domain, such as contractual risk clauses (Moon et al. 2022), requirements (Hassan andLe, 2020), and legal and contractual matters (Hassan et al. 2021). The results of these studies may offer promising research avenues for the computerbased relevance assessment of environmental regulations targeted in this work. However, one needs to consider that several fundamental differences exists between the text documents of the two domains. Therefore, it cannot be expected that the extraction methods for construction documents will provide reasonable results for documents of the environmental compliance management domain when just the used ontology or dictionary is tailored to the new domain.
NLP and AI in general have already been applied in the legal domain for several decades (Dale, 2019). However, today's large interest of researchers and practitioners in what is known as LegalTech was mainly caused by recent advancements in AI. According to Haney (Haney 2019), p. 3, 'Today, NLP is the most commonly used method of AI in the practice of law.' In a recent journal article, Dale (Dale, 2019) defines the following five areas of legal activity where NLP is playing an increasing role: legal research, electronic discovery, contract review, document automation, and legal advice. The particular CECM tasks that this research targets to support based on NLP belong most likely to the area of legal research characterized by 'finding information relevant to a legal decision' and 'electronic discovery' characterized by 'determining the relevance of documents to an information request.' However, only a few works can be found where the application of different NLP methods in these fields were studied.

The text analysis methods considered
Today, there exist a broad variety of different text analysis methods (Anandarajan et al. 2019;Bird et al. 2009;Gudivada and Rao, 2018). In recent years, the traditional methods mostly developed by the NLP and the Information Retrieval research community have been complemented by new methods that address what is often characterized as 'Big Data' through the use of machine learning approaches, especially the use of deep learning (Ghavami 2020). For the proposed relevance scoring method, four traditional NLP text analysis methods have been chosen that partially build on each other. In the following, a brief overview of these methods is given.
Keyword Frequency Analysis (KFA). As the name suggests, this method computes the frequencies of words or phrases as they appear in a text. The method often serves as the basic building block of higher-level text analysis methods (Illinois University Library 2022). However, for some use cases, the raw word counts or percentage numbers for words may already reveal useful insights.
Named Entity Recognition (NER). It is the objective of this method to recognize and extract specific types of entities, such as names of people, organizations, machine elements, locations, times, quantities, monetary values, percentages, and more in a text (Foleyet al. 2018; Jagota 1 3 2020). Typical use cases of the method locate entities to pull specific information from a text, to discover the subject of a given text, and to discover relationships among the entities. The method has also been used to improve web searches and document indexing and to support building an ontology. In these and other use cases, an NER analysis serves as a first preprocessing step, which is followed by other specialized text analysis processing steps.
Different NER algorithms have been developed. Dictionary-based algorithms use dictionaries of values of every entity type that is to be recognized. One of the main drawbacks of dictionary-based approaches is that they cannot effectively handle ambiguity. Algorithms that use probabilistic dictionaries particularly address the ambiguity of words. Pattern-based algorithms use regular expressions. These algorithms are most applicable when the targeted entities are best described by structural patterns. Several further traditional non-dictionary-based NER algorithms exist, such as rule-based algorithms and machine learning-based NER algorithms, which require large, annotated training data. In comparison to the simple text scanning and finding hits of the classical dictionary-based algorithms, these algorithms are usually far more complex. In some NER applications, a combination of multiple approaches has been used (Keretna, et al. 2014).
Document Subject Identification (DSI). Various techniques to identify the main subject(s) of a single document are described in the literature. According to D'Hondt, these techniques can be divided into two categories (D'hondt et al. 2011) p. 3784: '… techniques using statistical information extraction techniques and those exploiting lexical cohesion.' For both categories, algorithms have been proposed that are based on machine learning approaches. The 'latent Dirichlet allocation' (LDA) algorithm is a frequently used algorithm that explores subject probabilities from available statistical data.
In this work, we focus on DSI methods that exploit lexical cohesion without the use of machine learning techniques through a combination of a dictionary-based NER analysis and statistical methods, such as cluster analysis. It has been argued that the results of dictionary-based DSI algorithms are dependent on the semantic resources available for a specific text; therefore, the setup is limited to the text (D'hondt et al. 2011). However, this drawback can be at least partially overcome through the use of a comprehensive and well-developed dictionary. A methodology for building dictionaries was proposed by a research group at Carleton University (Denget al. 2019). This methodology suggested obtaining an initial version of the intended dictionary from an existing context-specific text corpus by applying an NER analysis.
Document Similarity Analysis (DSA). The goal of DSA is to measure the pairwise similarity between the text documents (Elia 2020). Corresponding techniques can be divided into DSA methods for this task that work on a lexical level (i.e., surface closeness of two text instances), meaning that they use only the words in the sentence, and methods that go beyond that and measure semantic similarity (i.e., similarity of meaning). Methods to measure semantic similarity attempt to explore the actual meaning behind words or the entire phrase in context. Clearly, this is a far more difficult measuring task than the task to measure lexical similarity. At the current stage of this research, it is focused on the use of similarity scores that just measures the lexical similarity between documents. One of the earliest techniques to compute such scores is the vector space model (Shajalal and Aono 2019). Methods that are built on this model compute similarity scores in two steps. First, the documents are transformed into a vector representation and then a similarity score is computed using a vector distance calculation formula. The 'Term Frequency-Inverse Document Frequency' (TF-IDF) vectors are frequently used in the vectorization step of many implementations of this method (Neto 2021). Often, for the similarity score calculation, the methods either use the cosine similarity, which has a value ranging from − 1 to 1, or the Euclidean distance. Other distance metrics are Jaccard, Manhattan, and Minkowski.

Environmental regulatory announcements
Today's growing body of environmental legislation consists of laws, acts, ordinances, statutory commands, treaties, subordinances, and other forms of environmental obligations for the business world (Campbell Gemmell and Marian Scott, 2013; German Environment Agency 2019; Ruhl 1997) that we subsume in the following by the notion of an 'environmental regulation' or just a 'regulation.' One of the main drivers of today's strong environmental regulation dynamics is the climate change action plan of the United Nations Organization (UNO), which encourages politicians around the world to tighten environmental laws. In general, the empowerment of environmental authorities to issue regulations is usually limited to a particular territory. Examples of territories of authorities given in a hierarchical order are a city, a county, a state, a country, and the territory of a supra-national union of countries, such as the territory of the European Union.
When a decision for a new regulation or for a revision of an already established regulation has been made, usually the greater public is informed by the authority through a regulatory announcement. Typically, the announcements are text documents that contain a copy of the authority's original regulation text or of the revision text. Additionally, announcement documents may contain metadata about the authority, metadata and general data about the regulation, and context-specific background information. The announcement documents are published through various channels, such as the internet, print media, special governmental media, online databases, special information agencies, and service providers. Some special service providers have the provision of a curated database of announcement documents as their business model along with explanations, guidelines, and recommendations for corporate compliance managers.
Clearly, announcements should be made in a timely fashion to give firms enough time to complete checks concerning the relevance for the firm and, when needed, to react through respective compliance enforcement measures. As of today, there does not exist a common format, nomenclature, or common language style for regulatory announcements. Abstract sentences with many technical terms and references to entities of the current body of environmental legislation are frequently used to describe new regulations and revisions of existing regulations. Therefore, much experience and effort are required to obtain the criteria that are to be checked to know if a regulation applies to a firm. The task of obtaining this knowledge through a corresponding investigation bears some similarities to what lawyers refer to as 'legal characterization' or 'legal subsumption.' A description of the theoretical foundation of the notion of subsumption, for example, is available in (Cyras and Lachmayer, 2014).
When regulatory announcements are explored for a relevance assessment, those items of the CECM work field that are addressed by the regulation and are prevalent in the business practice of the firm need to be investigated. We refer to these investigation items by 'items of CECM concern' (Thimm 2022). Table 1 contains several simple examples for this concept. It can be expected that in the CECM practice of typical multi-site production companies, many of the items of CECM concern account for the firms' Scope 1 and Scope 2 greenhouse gas emissions of the Greenhouse Gas Protocol (WBCSD and WRI, 2004).
To explore a new regulation or revision, we ask the following questions: What general field(s) of environmental law is (are) being addressed? What is the spatial/geographical boundary of the regulation? What particular business aspects (e.g., product properties, aspects of the production method, material usage) are being addressed? Which types of items of CECM concern are the focus of the regulation? What (pollution) limits (e.g., waste water temperature) are addressed? What conditions for exceptions are described? Which other entities of environmental legislation are related to the regulation and should be considered in the investigation? The answers to each of the questions need to be put into the context of the firm. The particular company situation with respect to contextualized versions of the above questions needs to be explored. Additionally, sample questions to ask include the following: Does the firm fit to the particular spatial/geographical scope of the regulation? Do the business activities of the company include particular entities addressed by the regulation? Do the particular entities satisfy the specific constraints set for the type of entities?
A relevant regulatory announcement may require compliance enforcement measures that are targeted at the particular business aspect and type of item of CECM concern explored in the relevance assessment. Table 1 describes some simple examples of measures and measure categories that may be taken into consideration for particular types of items of CECM concern. Note that regulatory announcements require firms to perform relevance assessments of the environmental legislation. Changes in the firm's business model, business processes, production processes, product portfolio, growth activities, and other movements to a new status quo may affect business aspects that are relevant for the work field of CECM. Firms are further obligated to assess these changes in terms of their conformance with the relevant environmental legislation.
Additionally, another relevant fact of this research is that companies are expected to maintain a (digital) regulation registry also known as regulation cadaster (Thimm 2015). The registry has to store all regulations and the respective relevance assessment results together with potential measures taken to enforce compliance. Clearly, a regulation registry (especially in digital form) is one of the most essential tools for effective environmental compliance management and other tasks of corporate environmental management, such as audit and inspection management and permit management. Additionally, annual environmental and sustainability reports, in addition to other information, are usually composed of data stored in the regulation registry, such as the number of environmental incidents, compliance violations (Thimm 2019), measures, and savings obtained by the measures.

The data model
In general, conceptual data modeling (Robinsonet al. 2015) is a discipline where a particular application domain or 'universe of discourse' is modeled, for example, as a founding step of a database development project. Conceptual models based on the well-known Entity Relationship Modeling (ERM) method address two main concepts at the intentional level: entity types depicted in Entity Relationship Diagrams (ERD) as labeled boxes and relationship types depicted as labeled rhombuses. The properties of both concepts are also addressed in the ERM method and in ERD diagrams  1 3 displayed as labeled circles. In ERD diagrams the maximum number of relationship instances in which entities can participate are indicated through respective cardinality numbers that are associated with the corresponding edges.
For the domain of CECM, the researchers devised a conceptual data model using the ERM method with cardinality information given in the classical Chen notation (Chen 1976). In this notation, n and m stand for two distinct numbers with arbitrary positive integer values. The resulting ERD in Fig. 1 is displayed in a version that abstracts from the properties in order to give a first overview of the complex data model. The full version of the model that includes the properties is contained in Fig. 2. All major concepts, issues, and definitions described in the previous section with respect to the relevance assessment CECM tasks are addressed in the model. The gray parts of the ERD are abstractions for company-specific aspects that are to be considered in the relevance assessment task. The ERD elements in blue address the domain knowledge required for this CECM task. Among these elements are the entity types (depicted in light blue color): 'Legislation Area Term,' 'Item Type Term,' and 'Authority Term,' which model the terms of three dictionaries. As described in the next section, these three dictionaries serve the NLP-processing steps of the proposed relevance assessment method.
The company-specific model part consists of the three entity types labeled 'Company,' 'Site,' and 'Item of CECM Concern.' Clearly, 'Company' models a company and 'Site' models an individual site of a company. The headquarters of a company (i.e., main place where it is registered) is modeled through the attribute 'site category.' Companies can consist of multiple sites that are modeled by respective cardinalities of the relationship type, 'consists of.' The relationship type 'belongs to municipality,' in general, models the sites that are associated with governmental structures. The model focusses through the relationship type on the smallest relevant 'governmental administration unit,' which  usually is the municipality (German Environment Agency 2019). Note that the containment of municipalities to larger 'governmental administration units' (e.g., states, countries, supra-national organizations) is also addressed in the CECM knowledge part of the model. The relationship type, 'has relevant legislation area,' models some areas of environmental legislation that are relevant for the site, while others are not. Moreover, whether the areas of environmental legislation are relevant for a site largely depends on what is going on at the site (e.g., production site, storage sites, development sites, administration) (German Environment Agency 2019). The two relationship types with the identical name, 'have assessed relevance for,' model the site-specific relevance of regulations and revisions.
Of the modeled CECM domain knowledge, the entity type, 'Geographical Area,' models geographical territories relevant to environmental legislation. In addition, some territories may be contained in other larger territories, which are addressed by the unary relationship type, 'belongs to next larger area.' 'Legislation Area' models the regulation areas that are considered areas of environmental legislation, such as 'water protection,' 'waste,' 'chemicals,' and 'air pollution' (German Environment Agency 2019). Dictionary terms addressed by the entity type, 'Legislation Area Term,' are, for example, 'waste water,' 'effluent,' 'sewage,' 'lead concentration,' 'particle content,' and 'leakage.' 'CECM Item Type' models aspects through which companies interfere with the environment, such as waste water, waste air, concrete waste, air pollution, and resource consumption (Thimm 2022). Also, the same type is used to model aspects that impose risks for the environment, such as explosive composites, hazardous material, leakages, water drain, and storm weather conditions. Respective dictionary terms modeled by the entity type 'Item Type Term' are terms frequently used for the contextualization of regulations and the detailed specification of limitations, threshold values, and critical values. Sample terms that refer to waste water are 'natural river discharge,' 'temperature limit,' and 'monitoring obligation.' Sample terms that refer to hazardous chemical substances are 'arsenic,' 'lead,' 'benzene,' 'chromium,' and 'toluene.' Clearly, the entity type, 'Authority,' models authorities that are entitled to issue environmental regulations and revisions (German Environment Agency 2019). Dictionary terms used in regulatory announcements to refer to these authorities are addressed by the entity type, 'Authority Term.' Examples of such terms include 'Environmental Protection Agency,' 'EPA,' 'European Environmental Agency,' 'EEA,' 'German Federal Environmental Agency,' 'UBA,' and 'Istituto Superiore per la Protezione e la Ricerca Ambientale,' 'ISPRA.' In general, these authorities issue regulations and revisions of regulations that are addressed in the model through the entity types, 'Regulation' and 'Revision.' Furthermore, some regulations set limits, for example, on exposure levels and content shares, and addressed by the attribute 'sets limit(s)' of the type 'Regulation.' The attribute, 'announcement text,' modeled for type, 'Regulation' and type 'Revision,' serves as a proxy for the respective announcement document. The binary relationship type, 'updates,' models the revisions that update existing regulations. The relationship type, 'refers to,' stipulates that a revision may refer to multiple regulations and vice versa and a regulation may refer to multiple revisions. The revisions that may lead to changes of earlier revisions of regulations are modeled by the unary relationship type 'updates.' Both regulations and revisions refer to items of the CECM concern expressed by the two identical relationship types, 'refers to CECM Item.' The geographical scope of regulations and revisions is modeled by the relationship type, 'has spatial scope.' The regulatory scope is modeled by the relationship type, 'has regulatory scope.' The regulations may refer to other regulations of the environmental legislation, which are modeled through the unary relationship type 'refers to.'

Toward an NLP-based relevance assessment method
On the basis of the conceptual data model described above, the researchers developed a first version of a relevance assessment method that builds on the possibilities of today's text processing technology. In particular, the NLP methods described in Sect. 3 are applied. The proposed assessment method computes site-specific numeric scores for a (new) regulation. A score indicates to what extent the regulation is relevant for the specific site. In principle, through exactly the same steps used by the method, relevance scores for revisions of regulations can also be computed. The relevance scores are obtained from heuristic rules that are derived from both intuition and observations of CECM practitioners. The rules are implemented in a multistep scoring scheme that investigates the content of regulations and CECM data at the firm level and at the firm site level. The data are contained in a central repository that is based on the conceptional data model described in Sect. 5 and in the following referred by the 'Environmental Compliance Knowledge and Data Repository' or in short 'EKDR.'

Scheme of method and a 'Toy Example'
The proposed scheme performs a relevance assessment for a new regulation document in several steps that are grouped into two subsequent phases, an initial scoping phase that is followed by a scoring phase. The variables used to describe the scheme are defined in Table 2. The scoping phase obtains the geographical and regulatory scope of the regulation in an initial analysis step. Then, the firm sites that are within the scope of the new regulation are explored and further investigated in the scoring phase. In the scoring phase, the CECM-specific context of the site is explored. This includes an analysis of the already registered regulations and of the registered items of CECM concern. An aggregated numeric relevance assessment score rs i ∈ [0; 2] is obtained for every site. Consequently, the interval [0; 2] serves as scoring scale A score of 0 means that the regulation is not relevant at all, whereas a maximum score of 2 indicates that the regulation is definitely of utmost relevance for the site. Scores above 0.8 are considered to indicate regulations to be of relevance.
The following description of the two phases is focused on the general high-level algorithm, the integrated NLP methods, and the data of the EKDR repository used by the processing steps. The data analyses are performed by functions that are defined in the following list. The document with the particular new regulation is passed to each function through parameter < regdoc > . Note that the given informal definitions abstract from the common NLP preprocessing steps, such as lemmatization and stemming (Anandarajan et al. 2019).
• A function denoted WCT(< regdoc >) is defined that computes the word count of a regulation document denoted by < regdoc > . • A function denoted by DSI-AUTHORITY(< regdoc > , < authority_dict >) is defined that identifies the authority that issued the new regulation. The function performs a dictionary-based NER analysis using a term dictionary denoted by < authority_dict > . The terms identify common authorities that act as environmental rule setters. When several authorities are extracted, sta- A relevance assessment score of company site S i concerning a particular R new xs i,q ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} A text similarity score computed by function DSA-REG which measures the pairwise lexical similarity between the textual content of two regulation documents; score xs i,q is specific to site S i X i = {xs i,1 , xs i,2 , …, xs i,v } A set of q: = 1, …, v text similarity scores xs i,q obtained for company site S i ss i ∈ [0; 1] A similarity score of company site S i cs i ∈ [0; 1] A normalized coverage score of company site S i cŝ i ∈ [0; 4] A calculative coverage score of company site S i rs i ∈ [0; 2] A relevance score of company site S i hsR i ⊂ R i , lsR i ⊂ R i Two disjunct sets of regulations already assessed for site S i with set hsR i containing regulations with a high similarity to R new and the set lsR i containing regulations with a low similarity to R new hts i , lts i ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} text similarity scores with hts i obtained from the high scores contained in set hsR i and lts i obtained from the set of low scores contained in set lsR i T i = {T i,1 , T i,2 , …, T i,k } A set of d: = 1, …, k technical terms T i,d that describe the regulatory context of a particular site S i for a specific regulation area TC i = {TC i,1 , TC i,2 , …, TC i,h } A particular set of c: = 1, …, h technical terms TC i,c obtained by the function TCA-CECM w ∈ ℕ Word count value of a particular regulation document obtained by function WCT wq1 i , wq2 i , wq3 i ∈ ℕ Values of the three lower quartiles computed by the function IQR-WCO for the word count frequencies of the set of regulation documents that are already assessed for site S i s i ∈ {0.25, 0.5, 0.75} Scaling factor used to calculate a coverage score relative to both the length of the new regulation R new and the length of the regulations already assessed for site S i tistical operations are used to identify the correct issuer of the regulation. • A function denoted by DSI-LEG-AREA(< regdoc, < legarea_dict >) is defined that identifies the particular environmental legislation area addressed by the new regulation. Using a dictionary that contains terms to identify regulation areas denoted by < legarea_dict > , an NER analysis is performed. The final selection from several extracted regulation areas is made through statistical operations. • A function denoted by DSA-REG(< regdoc > , < scored_ regdoc >) is defined that computes a text similarity score xs i,q ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. The text similarity score measures the pairwise lexical similarity between the document < regdoc > and a relevant already scored document denoted by < scored_regdoc > . The function compares the documents and based on a linear value assignment model returns the lowest score value of 0 when no similarity is found. Otherwise, a larger score is returned with the score value being derived from the extent of similarity. That is, a score value of 10 indicates maximum similarity. • A function denoted TCA-CECM(< regdoc > , < cecm_ dict >) is defined that searches the document < regdoc > for terms contained in the term dictionary denoted by < cecm_dict > . The function results in the set TC i = {TC i,1 , TC i,2 , …, TC i, w } that contains the w terms found in the document. • A function denoted IQR-WCO(< regcollection >) is defined that determines the frequency distribution of the word counts of a set of regulation documents denoted by < regcollection > . The function results the values of the lower quartile, the middle quartile (i.e., the median), and the upper quartile denoted by wq1, wq2, wq3, ∈ ℕ. Following general definitions of the statistics discipline, the value of wq1 implies that 25% of the set of documents are having a word count lower or equal than wq1. Likewise, 50% of the documents are having a word count lower or equal than wq2, and 75% of the documents are having a word count lower or equal than wq3.
Scoping phase. The scoping phase is performed through the following four steps: (1) Obtain the word count w of the regulation R new through function WCT(< regdoc >).
(2) Check the new regulation R new for common metadata patterns used to specify the authority. When the document does not contain proper metadata, then perform function DSI-AUTHORITY(< regdoc >). Retrieve from the EKDR repository the geographical area for which the identified authority sets regulations. Proceed with the retrieved geographical area being used as the geographical scope of the regulation. (3) Check the document for metadata patterns used to specify the regulation area. When no proper metadata are found, then perform function DSI-LEG-AREA(< regdoc >). Continue with the recognized regulation area being used as the regulatory scope of the regulation. (4) Retrieve the particular set of company sites S = {S 1 , S 2 , …, S k } from the EKDR repository where each site S i (i = 1,…, k) is within the geographical and the regulatory scope of the regulation.
Scoring phase. In this phase, the k sites of the set of sites S are assigned relevance assessment scores as i ∈ [0; 2]. The scores are obtained through the formula, as i = ss i + cs i . The component score ss i ∈ [0; 1], referred by similarity score, is obtained through a similarity analysis. The component score cs i ∈ [0; 1], referred by coverage score, is obtained through a coverage analysis. The intuitions behind these analyses and their principle processing steps to determine the scores are described in the following.
Similarity analysis. The rationale behind the similarity analysis is that when many similar regulations can be found that are relevant for the site, then the new regulation R new is also likely to be relevant. On the basis of this rationale, an algorithm has been devised that consists of the following five steps that result a similarity score ss i for a particular site S i with respect to R new : (1) Of the regulations already assessed for site S i, obtain from the EKDR repository the particular set of regulations that have the same regulatory scope as R new . Filter out for this set the set of j: = 1, …, g regulations R i = {R i,1 , R i,2 , …, R i,g }, which have been assessed to be relevant for site S i . (2) Perform a pairwise comparison between each regulation of the set R i and R new through function DSA-REG(< regdoc > , < scored_regdoc >) to obtain a corresponding set of q: = 1, …, v text similarity scores X i = {xs i,1 , xs i,1 , …, xs i,v }. (3) Use the text similarity scores of set X i to partition the set R i in two subsets: • a subset hsR i ⊂ R i that contains the j = 1,2, …, m (with m < = v) regulations R i,j for which a relative high similarity to R new was determined such that xs i,j ≥ 4 for all R i,j ∈ hsR i • a subset lsR i ⊂ R i that contains the p = 1, 2, …, n (with n < = v and n + m = v) regulations R i,p for which a relative low similarity or no similarity to R new was determined such that is xs i,p < 4 for all R i,p ∈ lsR i 1 3 (4) From set hsR i obtain a specific score referred by high text similarity score or just high score denoted by hts i ∈ {0, 1, 2 , 3, 4, 5, 6, 7, 8, 9, 10}. When set hsR i = {} then consider a high score of hts i = 0. Otherwise, assign hts i the rounded mean score of the elements of set hsR i . Likewise, obtain from set lsR i a specific score referred by low text similarity score or just low score denoted by lts i ∈ {0, 1, 2 , 3, 4, 5, 6, 7, 8, 9, 10}. Apply the rule used to obtain the high score in order to obtain the low score lts i from the set lsR i . (5) For R new compute the site-specific similarity score ss i ∈ [0; 1] from the high score hts i and the low score lts i through the following formula: Coverage analysis. Recall from above that the EKDR dictionary contains the regulatory context of a site S i which is given by the CECM items that are associated with S i . The coverage analysis focuses exactly at these items. The rationale behind the analysis is that when a relatively large number of a site's CECM items is contained in the description of the new regulation, then the new regulation is likely to be relevant. The total word count of the new regulation serves as the point of reference to obtain an appropriate relative measure concerning the regulation's number of CECM items. Using this rationale as the guiding principle, an algorithm of five steps has been developed. It computes a coverage score cs i for a new regulation R new and a particular site S i as follows: (1) Obtain from the EKDR dictionary the set of d = 1, …, k technical terms denoted by T i = {T i,1 , T i,2 , …, T i,k } that describe the regulatory context of S i regarding the particular regulation area of R new . (2) Perform function TCA-CECM(< regdoc > , < cecm_ dict >) with the regulation R new and the set T i being used as actual parameters. The function results in the set of c = 1, …, h terms TC i = {TC i,1 , TC i,2 , …, TC i,h } with each of the h terms being contained in both, in the set T i and the regulation R new . (3) Perform function IQR-WCO(< regcollection >) with the set of regulations R i obtained in the first step of the similarity analysis being used as actual parameter. This results the interquartile values wq1 i , wq2 i , and wq3 i of the word frequency distribution of R i . Apply the following rule to obtain from the result a scaling factor s i ∈ {0.25, 0.5, 0.75} that indicates how the word count w of R new compares to the word count frequency of R i : According to this rule, a small scaling factor s i is used when many regulations of R i exceed the word count w of R new. Conversely, a large scaling factor s i is used when many regulations of R i have a word count that is lower than w.
(4) For the given site S i, obtain a coverage score cs i ∈ [0; 1] that is specific to regulation R new using the cardinality of the term set TC i denoted by |TC i | and the cardinality of term set T i denoted by | T i |. When | T i |= 0 then use a coverage score cs i = 0. When | T i |≠ 0 then obtain cs i in two steps. First, obtain a calculative coverage score cŝ i ∈ [0; 4] through the following formula that uses the scaling factor s i : In a second step apply the following transformation to obtain the targeted (normalized) coverage score cs i from the calculative coverage score cŝ i : cs i = cŝ i ….. iff cŝ i ≤ 1, otherwise cs i = 1.
Practical 'Toy Example' to demonstrate the principles of the method. In the following, the scheme of the method described above is exemplified through two fictive application scenarios. The scenarios are invented from experience and also from some real data of an industry partner which is a German production company with multiple sites in Europe. The details of each scenario are described in the tables contained in the appendix.
A German manufacturing company is assumed which at three sites in Germany (Heilbronn S1, Karlsruhe S2, Bochum S3), at one site in Italy (Turin S4), and at one site in Austria (Orth S5) produces a set of diverse semi-finished goods for the furniture industry and the household appliances industry. Over the years a number of 69 regulations and revisions, respectively, concerning the area of waste management have already been assessed by the CECM specialists for each of the German sites. About the same number of regulations/revisions for waste management have also been assessed for the other two sites. The scenarios focus on environmental legislation for waste management and assume that a new regulation is set by the German Environmental Agency ('Umweltbundesamt'). The first scenario exemplifies the major steps of the scheme in order to obtain relevance assessments scores for a new waste management regulation concerning handling of end-of-life wood. In the second scenario, the assessment steps are performed for a new regulation concerning handling of end-of-live vehicles.
Note that the numbers used for the scenarios are obtained from respective real-world regulations and investigations of the regulation registry of the industry partner. In both scenarios in the initial scoping phase, the regulatory scope of the regulation is explored and the set of sites to be further investigated through the scheme's scoring phase is narrowed down to the three German sites S1, S2, and S3. Below, for each scenario the major results of the scoring phase and the final result are described for these three sites. Assessment of regulation concerning handling of endof-life wood. For the two production sites S1 and S3 that produce semi-finished goods for the furniture industry similarity scores of ss1 = 0.6 and ss3 = 0.7 are obtained. These scores reflect the fact that due to their focus site S1 and site S3 naturally deal with wood and therefore the method obtains a relatively large number of relevant regulations similar to the regulation for handling end-of-life wood. The relative low score of ss2 = 0.1 obtained for site S2 results from the site's specific production focus on goods for household appliances which logically implies a lower number of relevant similar regulations (|R2|= 8). That the new regulation is of much higher relevance for the sites S1 and S3 than it is for site S2 is also indicated by the respective coverage scores cs1 = 0.96 and cs3 = 1. These scores result from a comparison of the terms that describe the CECM-specific aspects of the sites with the content of the regulation document. The resulting aggregated relevance assessment scores rs1 = 1.56 and rs3 = 1.7 indicate that the new regulation on handling of end-of-life wood is of relevance for site S1 and site S3 but not of relevance for site S2, S4, and S5.
Assessment of regulation concerning handling of endof-life vehicles. Obviously, the regulation on handling endof-life vehicles addresses matters that are largely not of relevance for the sites S1, S2, and S3. Hence, the similarity scores ss1 = 0.1, ss2 = 0, and ss2 = 0.2 of the three sites are close to the minimal score of 0. The same holds true for the sites' coverage scores cs1 = 0.11, cs2 = 0, and cs3 = 0.29. Consequently, the regulation is not of relevance for any of the three sites which is indicated by the respective relevance assessments scores, rs1 = 0.21, rs2 = 0, and rs3 = 0.49.

Domain knowledge and company data: acquisition and curation
The proposed assessment method computes relevance scores based on domain knowledge and company-specific data. In various processing steps, the required items are retrieved from the ECKD data repository, which is derived from the data model described in Sect. 5. In the following, the major challenges and issues of the acquisition, the population, and the maintenance of these items of domain knowledge and company-specific data are discussed. Dictionaries for the text analyses. The relevance assessment method builds on three-term dictionaries that are used to extract information from a new regulation document. The dictionaries < authority_dict > and < legarea_dict > are used to extract the authority and the regulation area, respectively. The dictionary < cecm_dict > is used to explore items of CECM concern. In general, there are a number of alternatives to obtain initial versions of the term collections. First, NER analyses can be used to compute the term collections from a suitable document subset of an environmental law text corpus, such as the EUR-Lex text collection of documents about the European Union law (European Union 2022). A second alternative is to use a suitable subset of the terms specified in environmental reporting standards, such as GRI (GRI 2022) or CDP (CDP 2022). The third alternative is to select a proper subset of the topics defined by the EUROVOC topic hierarchy, which contains almost 4000 categories concerning different aspects of European law (Filtz et al. 2019). The initial term collections obtained through one of these alternatives or a combination of the alternatives are to be tested and optimized by CECM experts during trial runs of the assistance system. To keep the dictionaries upto-date and accurate, from time to time, their content also needs to be curated by the CECM experts.
Company-specific data. Clearly, the method's potential capability to compute company-specific relevance scores builds on processible data that specify the CECM context of the firm. Acquiring knowledge about all relevant data items and specifying these data items in a corresponding repository may seem to require substantial efforts, especially for large companies with many sites. However, much of the data may already be available in existing systems, such as corporate environmental management information systems, environmental, health and safety (EHS) systems, regulation cadasters, and other systems of the corporate information system landscape, including ERP systems, plant management information systems, facility management systems, process control systems, product lifecycle management systems, warehouse management information systems, energy management systems, and manufacturing execution systems. In some companies, CECM data items may even be contained in digital twins for products, production processes, and factories. Achieving an efficient extraction of relevant CECM data items from these systems through existing data exports and transformation functions is part of the next research steps. CECM data items may also be extracted from existing documents, such as material safety data sheets, product specification documents, dangerous goods specification documents, application documents and permits from environmental authorities, and special documents required when chemical substances are involved (e.g., REACH documents). The use of NLP methods to extract CECM data items from these documents seems to be a promising approach. Clearly, the (initial) company-specific data of the repository need to accurately reflect the current CECM circumstances of the firm. Hence, when chances occur, respective new items of CECM concern need to be inserted into the repository and already existing items need to be updated or deleted.

Forthcoming method evaluation and future improvements
At the current state of this ongoing research, a relevance assessment assistance system is implemented targeting a prototype solution. The prototype is implemented based on the Python programming language. Various open source NLP packages, such as SpaCy and NLTK, are used to benefit from contained general functions for text preprocessing tasks (e.g., tokenization, stemming, and lemmatization) and general functions for NER analysis and clustering. Some guiding principles for the implementation of tailored text processing algorithms based on standard components are adopted from the subject identification method proposed by (Jamil et al. 2017).
The prototype system is used to evaluate both the proposed relevance assessment method as a whole and each of the NLP-based functions. It can be expected that the insights from the evaluation will lead to revisions. The future revised method may use more advanced NLP techniques, including elements of the BERT framework that builds on machine learning techniques. Because of the multilanguage context of the application domain, a bilingual approach may contribute to a better performance of the targeted relevance assessment method. Guidance for an extension toward measuring semantic textual similarity with bilingual word-level capabilities, for example, is given by the work of (Shajalal and Aono, 2019).
Prototype-based method evaluation using a realworld dataset. The ECKD data repository of the demonstrator is populated with domain knowledge accumulated by the researchers from the scholarly literature and from a decade of intensive collaboration with researchers around the world, consulting companies, software vendors, and authorities specializing in the field of environmental compliance management. For the repository population with company-specific data, we will use a dataset provided by our industry partner, which is a globally acting mid-size production company in Germany. The company's production sites include two sites in Germany at which pressed parts, components, and automation solutions for the automotive industry and other industries are being produced. We already received a copy of the company's regulation cadaster in August 2021. This dataset contains more than one thousand regulations and approximately 400 revisions. Additionally, the cadaster stores the relevance assessments performed by the company's CECM experts and by an external consultant. Furthermore, data about scheduled and already implemented measures to enforce compliance with new regulations and revisions are also contained. The initial dataset is complemented by further company-specific data to be obtained through interviews with the company's environmental management department and an external consulting company. The respective announcement documents for the regulations and revisions will be downloaded from the corresponding official websites. The first version of the ECKD data repository will be populated with German terms and, to some degree, corresponding English terms.
Three alternatives to acquire suitable initial term collections (i.e., dictionaries) for the text analysis methods are described above. For the first prototype, the dictionaries will be obtained by performing an NER analysis with a suitable collection of German text documents selected from the document body of German environmental law and from scholarly documents. The extraction of terms from a specific classification hierarchy as targeted by the other alternatives is considered for later evaluation phases.
However, only part of the entire company-specific set of regulations and revisions will be populated in the initial ECKD repository. Following the general advice of machine learning practitioners, the dataset will be split into three sets: a basic system setup dataset, a validation dataset, and an evaluation dataset. The setup dataset will consist of approximately 600 assessed regulations that will be populated in the ECKD repository. Approximately 200 announcement documents will be used for a first validation of the method's accuracy. For every document, i.e., new regulation, the relevance scores for the two sites of the company will be computed with the method and compared to the relevance assessment of the CECM experts. Through this comparison, possible incorrect assessment scores of the method can be revealed. The insights obtained from the first validation can be used to improve and calibrate the method and most likely also the content of the ECKD data repository to improve the method accuracy. In a subsequent phase, the revised method will be evaluated and possibly improved again based on the evaluation dataset of approximately 200 further announcements documents assessed by the CECM experts of the company.
Considerations for future improvements of the method. The validation and evaluation of the method will most likely reveal opportunities for improvements. It is also expected that the forthcoming comparison of the method with knowledge extraction methods recently proposed for the construction engineering domain (Hassan et al. 2021;Le, 2020, 2022;Moon et al. 2022) will yield optimization options. Several opportunities for improvements that are worth further investigation have already been identified in the present state of the research project. First, it is expected that the accuracy of the method can be improved through the use of dictionaries that are generated from existing classification hierarchies. In particular, we will investigate options to obtain dictionaries from the classification system of the environmental reporting standards, GRI (GRI 2022), and the EUROVOC topic hierarchy (Filtz et al. 2019), which is addressed, and the European law documents. A second objective of our future research is to investigate how the curation of the content could be automated through a mechanism that may involve interactions with expert users. A third improvement option concerns the text similarity analysis of the method. In the initial version of the method, the similarity analysis only considers regulations that have the same regulatory scope as the new regulation. The similarity analysis could be extended to also consider the regulations that are referred by the regulations that have the same scope. This approach to improve the method accuracy and further approaches explored in the evaluation phase will be investigated in our future research. We also plan to add a component that supplies the user with an explanation of the resulting score (i.e., a scoring report) and analytical capabilities to obtain insights about the scoring steps.

Conclusion
Intelligent assistance systems are already in use or are being developed for many different domains. However, it seems that today, there is still little interest in the research community and the software industry to invent and study assistance systems for corporate environmental compliance management. With the relevance assessment method and the underlying data model, core building blocks for a novel CECM assistance system are proposed in this work. It is assumed that in particular, a cloudbased approach where the domain knowledge and the companyspecific data are shared by a set of assistance tools may enable companies to effectively and efficiently perform environmental compliance management duties and to prevent accidental breaches of environmental laws.

Author contributions
The author confirms sole responsibility for the following: study conception and design, data collection, analysis and interpretation of results, and manuscript preparation.
Funding Open Access funding enabled and organized by Projekt DEAL. The author declares that no funds, grants, or other support were received during the preparation of this manuscript.

Data availablity
The datasets generated during and/or analyzed during the current study are not publicly available due to individual privacy concerns but are available from the corresponding author on reasonable request.

Conflict of interest
The author has no relevant financial or non-financial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.