How much regulation exists? Can short- and long-term growth trends in regulation be identified? Which agencies produce the most regulation? Are some sectors of the economy more regulated than others, and how big are the differences? RegData 2.2, a recent panel dataset from the RegData Project at George Mason University’s Mercatus Center, offers answers to these questions and more. RegData 2.2 quantifies various aspects of US federal regulations by industry, by agency, and over time. The resulting datasets include metrics on volumes, restrictiveness, and relevance of federal regulations to different economic sectors and industries. RegData datasets are publicly released at http://quantgov.org. We explain the features of and methodology underlying RegData 2.2.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
For more on the QuantGov Project, visit http://quantgov.org.
Earlier versions of RegData also mapped regulations to NAICS-defined industries, but they used a human-assisted algorithm to achieve the mapping, rather than machine learning algorithms. The human-assisted algorithm used in the first two versions of RegData (1 and 2.0) is explained in great detail in Al-Ubaydli and McLaughlin (2015).
If classifications for a given industry are not sufficiently reliable, that industry is included only in a supplemental, unfiltered dataset. For some industries, it is not possible to produce classifications at all because of the small number of example documents. See Sect. 3.3 below.
For RegData 2.2, our error detection and smoothing process proceeded in two main steps. First, if the section-level number could not be parsed because of OCR errors, a rolling-plurality vote for the previous 10 sections was used as the appropriate section. That approach was taken to localize errors to a single section rather than an entire part and to ensure that one-off errors were not carried forward to other sections.
Second, after the initial parsing, the file size for each part was analyzed for every year it was present in the CFR. Parts present in a single year, but not the year before or the year after were dropped. Parts missing in a single year, but present the previous year or the following year, were filled in using the text from the part in the preceding year. Because part size generally follows a smooth trend, we also corrected for outlier discontinuities. If a part’s file size was not within 15% of either the previous or following year, the text from the previous year was used.
Subsequent versions of RegData have added additional years of coverage. RegData 3.0 spans 1970–2016, while 3.1 covers 1970–2017. These datasets also are available at http://quantgov.org/data.
Scikit-learn is an open source set of machine learning tools and algorithms for the programming language, Python, available at: http://scikit-learn.org/stable/.
Lemmatization refers to an algorithmic process common to computational linguistics where a computer program identifies a word’s “lemma,” or dictionary form. For example, the word “environment” is the lemma for the adjective, “environmental.” Lemmatization lets occurrences of different inflected forms of the same lemma (such as “environmental” in the example above) be analyzed as a single category or item. WordNet is open source Python script that performs lemmatization and is available as part of the Natural Language ToolKit (NLTK) package at https://www.nltk.org/install.html.
Precision is calculated as TP/(TP + FP), where TP is true positives and FP is false positives. Recall is calculated as TP/(TP + FN), where FN is false negatives. In both cases, the highest possible score for a model along the single dimension equals one. The F1 score, therefore, also has a maximum possible score of one, but that is not necessarily desirable. There is usually a tradeoff between the two dimensions. A model can have very high precision because it creates many false negatives. F1 scores are useful comparing models for a given classification project while balancing between those two dimensions. However, the machine learning community typically cautions against comparing one project to another by using F1 scores because precision or recall may be valued in different ways in different projects.
Several of the articles in this special issue use RegData 2.2, including Bailey et al. (2018), Chambers et al. (2018a, b, c), Manish and O’Reilly (2018) and Mulholland (2018). Here is a short and by no means comprehensive list of other journal articles: Ellig and McLaughlin (2016), Bailey and Thomas (2017), Goldschlag and Tabarrok (2018) and Pizzola (2018). A more comprehensive list, including dozens of working papers, is available at: http://quantgov.org/research.
Al-Ubaydli, O., & McLaughlin, P. A. (2015). RegData: A numerical database on industry- specific regulations for all United States industries and federal regulations, 1997–2012. Regulation & Governance, 11(1), 109–123.
Bailey, J. B., & Thomas, D. W. (2017). Regulating away competition: The effect of regulation on entrepreneurship and employment. Journal of Regulatory Economics, 52(3), 237–254.
Bailey, J. B., Thomas, D. W., & Anderson, J. R. (2018). Regressive effects of regulation on wages. Public Choice. https://doi.org/10.1007/s11127-018-0517-5.
Chambers, D., Collins, C. A., & Krause, A. (2018a). How do federal regulations affect consumer prices?. An analysis of the regressive effects of regulation: Public Choice. https://doi.org/10.1007/s11127-017-0479-z.
Chambers, D., McLaughlin, P. A., & Stanley, L. (2018b). Barriers to prosperity: The harmful impact of entry regulations on income inequality. Public Choice. https://doi.org/10.1007/s11127-018-0498-4.
Chambers, D., McLaughlin, P. A., & Stanley, L. (2018c). Regulation and poverty. Public Choice, this issue.
Coffey, B., McLaughlin, P. A., & Tollison, R. D. (2012). Regulators and redskins. Public Choice, 153, 191–204.
Coglianese, C. (2002). Empirical analysis and administrative law. University of Illinois Law Review, 4, 1111–1138.
Dawson, J. W., & Seater, J. J. (2013). Federal regulation and aggregate economic growth. Journal of Economic Growth, 18(2), 137–177.
Ellig, J., & McLaughlin, P. A. (2016). The regulatory determinants of railroad safety. Review of Industrial Organization, 49(2), 371–398.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27–8, 861–874.
Goldschlag, N., & Tabarrok, A. (2018). Is regulation to blame for the decline in American entrepreneurship? Economic Policy, 33(93), 5–44.
Huang, J., & Ling, C. X. (2005). Using AUC and accuracty in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17–3, 299–310.
Manish, G. P., & O’Reilly, C. (2018). Banking regulation, regulatory capture, and inequality. Public Choice. https://doi.org/10.1007/s11127-018-0501-0.
Mulholland, S. E. (2018). Stratification by regulation. Public Choice, this issue. https://doi.org/10.1007/s11127-018-0597-2.
Mulligan, C., & Shleifer, A. (2005). The extent of the market and the supply of regulation. Quarterly Journal of Economics, 120, 1445–1473.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Pizzola, B. (2018). Business regulation and business investment: Evidence from US manufacturing 1970–2009. Journal of Regulatory Economics, 53(3), 243–255.
About this article
Cite this article
McLaughlin, P.A., Sherouse, O. RegData 2.2: a panel dataset on US federal regulations. Public Choice 180, 43–55 (2019). https://doi.org/10.1007/s11127-018-0600-y
- Policy analytics
- Machine learning