Measuring and Modeling the U.S. Regulatory Ecosystem

Over the last 23 years, the U.S. Securities and Exchange Commission has required over 34,000 companies to file over 165,000 annual reports. These reports, the so-called “Form 10-Ks,” contain a characterization of a company’s financial performance and its risks, including the regulatory environment in which a company operates. In this paper, we analyze over 4.5 million references to U.S. Federal Acts and Agencies contained within these reports to measure the regulatory ecosystem, in which companies are organisms inhabiting a regulatory environment. While individuals across the political, economic, and academic world frequently refer to trends in this regulatory ecosystem, far less attention has been paid to supporting such claims with large-scale, longitudinal data. In this paper, in addition to positing a model of regulatory ecosystems, we document an increase in the regulatory energy per filing, i.e., a warming “temperature.” We also find that the diversity of the regulatory ecosystem has been increasing over the past two decades. These findings support the claim that regulatory activity and complexity are increasing, and this framework contributes an important step towards improving academic and policy discussions around legal complexity and regulation.


Introduction
Economies, like ecosystems, exhibit dynamic, complex behaviors resulting from the interaction of "organisms" inhabiting, altering, and being altered by their "environment." In the case of economies, organisms can be seen as companies, and environments can be seen, at least in part, as regulations. Just as changes in the environment like rising temperature can harm or help organisms, either broadly or for specific regions or organisms, so too can reg-B Michael J Bommarito II michael.bommarito@gmail.com 1 Illinois Tech -Chicago Kent College of Law, Chicago, IL, USA 2 CodeX -The Stanford Center for Legal Informatics, Stanford, CA, USA ulation harm or help companies. Yet unlike studies of biological ecosystems, studies of the economy have thus far lacked a longitudinal, empirical measure of fundamental quantities like "temperature" or "diversity." In this paper, we attempt to bridge this gap, finding support for the common claim that regulatory activity and complexity has increased over the last 20 years.
First, we develop a simple theoretical model of regulatory "ecosystems" that allows us to conceptualize regulatory energy and diversity. We then proceed to measure these quantities over time through an analysis of filings made by companies registered under the Securities and Exchange Commission (SEC) in the United States. These filings, the so-called "Form 10-Ks" that are made each year by companies registered under the Securities and Securities Exchange Acts of 1933 and 1934, provide a broad overview of company performance and risks. In particular, 10-Ks highlight regulatory risks and uncertainty that companies face, allowing for the documentation of longitudinal and comprehensive data related to the regulatory ecosystem. In total, we analyze more than 20 years, 30,000 companies, and 160,000 10-K reports to identify more than 4.5 million references to U.S. Federal regulatory Acts and Agencies. Using these references, we generate a reproducible, quantitative, and longitudinal measurement of the energy, "temperature," and diversity of the U.S. regulatory "ecosystem." We find a clear increase for all of these measures, with double-to triple-digit growth over the last 20 years. We believe this framework and its ongoing application represent a principled approach to the quantification of regulatory ecosystems, and we hope that this research can drive better-grounded discussions of legal complexity and policy design in the modern world.

Data
10-K filings have been the focus of many academic studies in finance and accounting [1][2][3][4], law [5,6] and other adjacent fields [7][8][9]. Many of these studies have focused upon questions such as whether the reporting requirements are achieving their intended purpose and the extent to which markets react to the information disclosed in such findings. While these are certainly worthy questions in their own right, we believe these filings reveal important patterns that, at scale and over time, provide meaningful insight into a range of other scientific questions.
Companies that meet the requirements for 10-K reporting expend significant and increasing resources to prepare these documents, typically with the assistance of accounting firms and lawyers. Indeed, the Annual Audit Fee Survey [10] conducted by the Financial Executives Research Foundation reveals a mean and median 2015 expense of $1.8M and $522,205, respectively, across over 6000 filers. As required by law, the figures and statements contained within these reports are certified and attested to by both a company's officers and its independent auditors, and companies and their officers have strong incentives to faithfully and comprehensively report. On the contrary, the competition for capital ensures that organizations do not over-report risks relative to industry peers, as this may frighten away investors. Therefore, unlike other sources of information, these 10-K annual reports are generally more likely to convey a comprehensive and balanced description of the environment in which a company operates.
Form 10-K filings generally contain at least four parts and fifteen schedules, which collectively offer a wealth of useful information about registered companies. These parts include a characterization of a company's financial health, legal risks, and other systematic and idiosyncratic factors, such as the nature of the regulatory environment in which it operates. Some of these factors, such as tax credits, may be positive, but the majority of listed regulatory factors are present as risks. While there are a number of specific requirements under the law, firms and industries are provided with some latitude regarding how to satisfy reporting requirements. In addition, as explored in [1] and [5], there have been some important changes in the reporting rules over time. That said, SEC form templates and accounting firm standards result in more similarity than difference across firms and across time.
Through this exercise of risk factor disclosure, companies typically describe various sources of regulatory risk, including the laws and administrative agencies that are most relevant to their respective businesses. Consider, for example, the 2015 10-K filing of Trans Energy, Inc., an oil and gas exploration company. Among other statutory and agency references, their filing references both the Migratory Bird Treaty Act of 1918 and Endangered Species Act of 1973: The Endangered Species Act ("ESA") was established to protect endangered and threatened species. Pursuant to that act, if a species is listed as threatened or endangered, restrictions may be imposed on activities adversely affecting that species' habitat. Similar protections are offered to migratory birds under the Migratory Bird Treaty Act. The Company conducts operations on oil and natural gas leases that have species, such as raptors, that are listed and species, such as sage grouse, that could be listed as threatened or endangered under the ESA.
Across the broader set of required company disclosures, 10-K filings are filled with references such as these; some, such as citations to Sects. 13 and 15(d) of the Securities Exchange Act of 1934, are boilerplate and required by the SEC's forms. Others, such as the 756 references to the Migratory Bird Treaty Act of 1918, or the 27 references to the Price-Anderson Nuclear Industries Indemnity Act of 1957 over the last 23 years, are not.
In the mid-1990s, the SEC introduced its Electronic Data Gathering, Analysis, and Retrieval ("EDGAR") system. Since then, nearly all registered company 10-K reports have been uploaded and made available on EDGAR, resulting in more than 160,000 10-K reports accessible online. We retrieve these 10-K reports and build a multi-stage pipeline that identifies and normalizes references to Acts and Agencies. References are first identified through standard natural language processing techniques; once a reference fragment is identified, it is then passed through a second stage of normalization. As one example, many filers reference the "Gramm Leach Bliley Financial Services Modernization Act of 1999;" however, they do not do so using its full name, as above. Instead, they frequently refer to it as "GLB," "Graham Leach Bliley," "Gramm Leach Bliley," or the "Financial Services Modernization" Act. In order to handle this variation, we built a mapping for over 600 potential Act references, relying on a combination of the US Code, Wikipedia, and manual review. This mapping is then combined with fuzzy-string matching techniques to correct for spelling mistakes such as "Graham Leach Bliley." The result is a high-precision and high-recall extraction of 401 unique Federal Acts and 133 Agencies across our 23-year dataset. In total, we identify more than 4.5 million Act and Agency references contained in 10-K reports over the past 23 years. may contain or interact with resources or organisms like timber, infrared radiation, or cattle. These cells may also provide an area over which measures like averages or sums may be calculated. For example, since the 1970s, scientists have been using satellite instrumentation and representations of the earth's surface to model systems related to surface temperature, rainfall, land use, and soil hydrology [13,14].
We follow this spatial systems approach to model the regulatory environment. In the general case with M companies and N regulations, we can represent the regulatory ecosystem as M vectors in an N -dimensional space. We refer to each of these M vectors as a company or filing "profile." These profiles can encode only for the binary presence of regulatory exposure-a 0 or 1 in each dimension for each regulation. The profiles can also encode for the number of references in that regulatory dimension-0 or more for each regulation. At present, the regulatory space consists of N = 401 dimensions-one for each Act currently identified in our data.
Initially, in the ground state where no regulation exists, these M company vectors are all equal to the zero vector. As institutions "perform work" by expending "regulatory energy" to regulate companies, however, one or more of these M company vectors move, i.e., their profile vectors increase away from the zero vector. In the simplest case, where institutions regulate only one dimension i and all companies are homogeneous, then all M vectors move along dimension i proportional to regulatory work. This change in position away from the ground state in regulatory space captures an increase in total energy in the system. Subsequently, institutions may continue to perform work to regulate or de-regulate companies, resulting in increased or decreased energy along that dimension i.
In reality, institutions may simultaneously regulate and de-regulate along many dimensions. Furthermore, regulatory energy may not translate to equal motion in all cases. In some cases, non-conservative forces like lobbying "frictions" may reduce motion. In other cases, companies may exhibit different "mass" or other heterogeneity, translating equal regulatory energy into unequal regulatory "motion." Companies may also mutate, grow, or die over time, changing both the behavior of institutions and other companies as well as their own exposure to regulatory forces. In any case, the result is that companies may occupy varying positions in regulatory space over time. However, by measuring the total change in position from the ground state, we therefore obtain a measure of total energy in the regulatory system at a given time.
In the simplest case where institutions regulate homogeneous companies and all companies experience regulation equally, all M companies occupy the same position in regulatory space. However, as these assumptions are relaxed, then the positions of companies may vary across regulatory space. Measuring the "diversity" or degree of variation between companies allows us to understand how far away from the homogeneous case a regulatory system is. Just as in ecology and physics, the degree of diversity between companies has important implications for the efficiency and fragility of systems.

Methodology and Results
To formalize these concepts with notation, let a 10-K profile p(a) in Z N have element p i (a) equal to the number of references to Act i for a company a's annual filing. This vector p can then be normalized or projected. For example, we can project p from the number of references to a "bitstring" vector b(a), whose element b i (a) is equal to 1 if a company a's filing mentions Act i at least once, else 0. Alternatively, we can aggregate or normalize these filings by viewing them as a time-indexed collection. Let F(t) be the matrix whose rows It is possible to normalize the number of references per Act to a rate of reference per filing by dividing the column sums of F(t) by the number of rows m(t), which we call r j (t) for jth Act in year t.
We then use these representations to measure the total energy, temperature, and diversity of the regulatory ecosystem as follows. First, we measure the total energy of the regulatory ecosystem using p-vectors as: Figure 1 shows that the total energy, as measured by number of references to Acts per year, has increased substantially in the last 23 years. In 1996, there were just over 40,000 references to Acts in the nearly 5000 filings that year; by 2006, these numbers had more than quadrupled to nearly 200,000 references in just over 9000 filings; and, through three quarters of 2016, these numbers have again increased to an annualized rate of over 300,000 references.
Total energy alone can be misleading with respect to policy interpretations, however, as there are a number of reasons why energy may change without relation to regulatory exposure or "burden." For example, (i) the economy may grow or shrink in real or nominal terms, increasing or decreasing the total number of companies or companies meeting registration requirement, (ii) the SEC rules governing registration or filings may change, increasing or decreasing the number of companies or references, or (iii) c.p., the relative incentives to incorporate or take on shareholders may change, increasing or decreasing the number of companies registered. These factors do not necessarily imply more or less regulation as experienced by individual companies, although they may be viewed as endogenous to some policy questions.  We can control for these factors by normalizing total energy to a "temperature," taking into account the number of filings per year as an analogy to area or volume. While this conception simplifies the traditional distinction between types of energy, we select it for its ecological analogy. To calculate, we take the average rate of references per filing, "temperature," T (t) as follows: Figure 2 shows that, over the last 23 years, T (t) has been monotonically increasing. While the rate of reference, like the total energy in Fig. 1 above, clearly shows the effect of the Sarbanes-Oxley changes in 2003, this trend remains unbroken both well before and well after. In 1996, the average number of references per filing was 8.4; by 2006, it had more than doubled to 20.9; and by 2016, the rate had increased again by more than 50% to 31.7 Act references per filing. Even if the amount of energy or cost does not scale linearly per filing with the number of references, the monotonic, 237% increase in T (t) clearly demonstrates an increasing regulatory temperature. Table 1 below summarizes the data from Figs. 1 and 2 above. Finally, we may ask-is temperature or energy changing in concert with diversity, or is the change in temperature concentrated along a single dimension of regulation? For example, Fig. 3 Average number of unique Acts and Agencies per filing over time changes in energy or temperature can represent more or less reliance on the same Act, e.g., the Securities Exchange Act of 1934; in this case, the number of unique Acts referenced is not changing, but the regulatory exposure per Act is. Alternatively, the total number of unique Acts referenced could be increasing or decreasing; for example, in 2003, most registered companies added references to the recently enacted Sarbanes-Oxley Act, which had not previously been referenced. Changes such as these represent an increase or decrease in the number of dimensions or diversity of regulatory exposure, but not necessarily the intensity of each exposure.
Using our notation above, we evaluate the diversity question by calculating two measures. First, we calculate the number of unique Acts per filing through the sum of b vectors above. Then, we calculate the average number of unique Acts per filing, across all companies in a given year; this is 1 i is the bit corresponding to whether the ith Act was referenced in the jth company filing and m is the number of companies per year. Figure 3 shows that, over the last 23 years, the diversity of Acts referenced has increased jointly with temperature. Like Figs. 1 and 2 above, the time series exhibits a jump following Sarbanes-Oxley; however, like Fig. 2, the time series also exhibits a monotonic increase over two decades, growing from 3.1 unique Acts per filing in 1996 to 5.6 unique Acts per filing in 2006 to 7.9 Acts per filing in 2016. This increase suggests that the increase in regulatory ecosystem temperature has been, at least in part, related to an increase in the number of dimensions along which institutions are regulating.
As an additional measure of diversity, we analyze each company's yearly regulatory "bitstring." As noted earlier, we calculate the 401-bit vector b for each company-year, where each bit corresponds to the presence of the 401 discrete Acts we identify. Although the regulatory space has 401 dimensions, the bitstring for a given filing is likely to be extremely sparse. For example, consider the 2012 10-K filed by the Boeing Company. Their filing features a bitstring with 12 non-zero elements, including Acts such as the Homeland Security Act, the Employee Retirement Income Security Act, the Patient Protection And Affordable Care Act and the American Taxpayer Relief Act. Alternatively, the 2014 10-K of Facebook Inc. features 10 unique elements, including The Bank Secrecy Act, the U.S. Foreign Corrupt Practices Act, the USA Patriot Act, and the Credit CARD Act.
After applying this formalization to all companies for all years, we calculate the average pairwise Hamming distance [15] between all company bitstrings in a given year. Hamming distance is commonly used to evaluate the diversity of genomic [16][17][18] and other related data [19][20][21]. It can be interpreted as proportional to the average number of regulatory dimensions not in common between companies. More explicitly, the Hamming distance between two companies a and b in year t is: where ⊕ is the element-wise XOR operator. We can then write the average Hamming distanced(t) as the average over all combinations of a and b at t. Figure 4 visualizes the structure of the distance matrix D for all a, b as of 1994. The large block in the lower right corresponds primarily to special purpose vehicles like trusts or limited partnerships, and the overall structure corresponds to sectors and industries. Figure 5, the average Hamming distanced(t) over time for Acts and Agencies, portrays mean-field distance between firms at scale, confirming an increasing diversity across the  global regulatory ecosystem. Over time, companies are subject, on average, to increasingly different requirements. While not monotonically increasing like the rate of reference and number of unique references above, the average distance increases 18 of 23 years in the sample. In 1996, two firms were separated on average by fewer than four regulatory Act "bits" or "genes"; by 2016, this number has increased to nearly 10. Table 2 summarizes the data from Figs. 3 and 4 above.

Conclusion and Future Work
In this paper, we have presented the first large-scale, longitudinal characterization of the energy, "temperature," and diversity of the regulatory ecosystem as characterized by our spatial model. We have identified increasing regulatory exposure along an increasing number of dimensions, providing evidence in support of the claim that regulatory burden is increasing. Using a bitstring representation of firm regulatory exposure, we have confirmed that the aggregate Federal regulatory ecosystem is becoming more diverse over time, providing evidence in support of the claim that regulatory complexity is increasing. These conclusions are based on more than 20 years, 30,000 companies, 160,000 10-K reports, and 4.5 million references contained in uniquely comprehensive and accurate 10-K reports.
In future work, we intend to expand upon these questions and connect to an extant research agenda, including the development of a more detailed theoretical framework, the modeling of agent-based or computational regulatory ecosystems, the categorization of regulatory "species" and "climates," and the integration of this analysis with our existing work on the complexity of other statutory, regulatory, and judicial systems [22,23].
Our work contributes to both the broader literature on legal complexity [24][25][26][27] and efforts to document the physical properties of legal systems as complex adaptive systems [28][29][30][31][32][33]. In addition, this paper is among a growing set of recent works applying tools of machine learning and natural language processing to better understand the behavior of various legal systems [34][35][36].
In sum, we believe that this framework for modeling and measurement will contribute to ongoing academic and policy discussions around legal complexity and policy design. The continued development of both global and specialized regulatory indices can provide for a principled, empirical basis of evaluation, standing in stark contrast to the vague generalizations that frequently guide current policy decisions.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.