RegData 2.2: a panel dataset on US federal regulations

Abstract

How much regulation exists? Can short- and long-term growth trends in regulation be identified? Which agencies produce the most regulation? Are some sectors of the economy more regulated than others, and how big are the differences? RegData 2.2, a recent panel dataset from the RegData Project at George Mason University’s Mercatus Center, offers answers to these questions and more. RegData 2.2 quantifies various aspects of US federal regulations by industry, by agency, and over time. The resulting datasets include metrics on volumes, restrictiveness, and relevance of federal regulations to different economic sectors and industries. RegData datasets are publicly released at http://quantgov.org. We explain the features of and methodology underlying RegData 2.2.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. 1.

    For more on the QuantGov Project, visit http://quantgov.org.

  2. 2.

    Earlier versions of RegData also mapped regulations to NAICS-defined industries, but they used a human-assisted algorithm to achieve the mapping, rather than machine learning algorithms. The human-assisted algorithm used in the first two versions of RegData (1 and 2.0) is explained in great detail in Al-Ubaydli and McLaughlin (2015).

  3. 3.

    If classifications for a given industry are not sufficiently reliable, that industry is included only in a supplemental, unfiltered dataset. For some industries, it is not possible to produce classifications at all because of the small number of example documents. See Sect. 3.3 below.

  4. 4.

    For RegData 2.2, our error detection and smoothing process proceeded in two main steps. First, if the section-level number could not be parsed because of OCR errors, a rolling-plurality vote for the previous 10 sections was used as the appropriate section. That approach was taken to localize errors to a single section rather than an entire part and to ensure that one-off errors were not carried forward to other sections.

    Second, after the initial parsing, the file size for each part was analyzed for every year it was present in the CFR. Parts present in a single year, but not the year before or the year after were dropped. Parts missing in a single year, but present the previous year or the following year, were filled in using the text from the part in the preceding year. Because part size generally follows a smooth trend, we also corrected for outlier discontinuities. If a part’s file size was not within 15% of either the previous or following year, the text from the previous year was used.

  5. 5.

    Subsequent versions of RegData have added additional years of coverage. RegData 3.0 spans 1970–2016, while 3.1 covers 1970–2017. These datasets also are available at http://quantgov.org/data.

  6. 6.

    Scikit-learn is an open source set of machine learning tools and algorithms for the programming language, Python, available at: http://scikit-learn.org/stable/.

  7. 7.

    Lemmatization refers to an algorithmic process common to computational linguistics where a computer program identifies a word’s “lemma,” or dictionary form. For example, the word “environment” is the lemma for the adjective, “environmental.” Lemmatization lets occurrences of different inflected forms of the same lemma (such as “environmental” in the example above) be analyzed as a single category or item. WordNet is open source Python script that performs lemmatization and is available as part of the Natural Language ToolKit (NLTK) package at https://www.nltk.org/install.html.

  8. 8.

    Precision is calculated as TP/(TP + FP), where TP is true positives and FP is false positives. Recall is calculated as TP/(TP + FN), where FN is false negatives. In both cases, the highest possible score for a model along the single dimension equals one. The F1 score, therefore, also has a maximum possible score of one, but that is not necessarily desirable. There is usually a tradeoff between the two dimensions. A model can have very high precision because it creates many false negatives. F1 scores are useful comparing models for a given classification project while balancing between those two dimensions. However, the machine learning community typically cautions against comparing one project to another by using F1 scores because precision or recall may be valued in different ways in different projects.

  9. 9.

    Several of the articles in this special issue use RegData 2.2, including Bailey et al. (2018), Chambers et al. (2018a, b, c), Manish and O’Reilly (2018) and Mulholland (2018). Here is a short and by no means comprehensive list of other journal articles: Ellig and McLaughlin (2016), Bailey and Thomas (2017), Goldschlag and Tabarrok (2018) and Pizzola (2018). A more comprehensive list, including dozens of working papers, is available at: http://quantgov.org/research.

References

  1. Al-Ubaydli, O., & McLaughlin, P. A. (2015). RegData: A numerical database on industry- specific regulations for all United States industries and federal regulations, 1997–2012. Regulation & Governance, 11(1), 109–123.

    Article  Google Scholar 

  2. Bailey, J. B., & Thomas, D. W. (2017). Regulating away competition: The effect of regulation on entrepreneurship and employment. Journal of Regulatory Economics, 52(3), 237–254.

    Article  Google Scholar 

  3. Bailey, J. B., Thomas, D. W., & Anderson, J. R. (2018). Regressive effects of regulation on wages. Public Choice. https://doi.org/10.1007/s11127-018-0517-5.

    Google Scholar 

  4. Chambers, D., Collins, C. A., & Krause, A. (2018a). How do federal regulations affect consumer prices?. An analysis of the regressive effects of regulation: Public Choice. https://doi.org/10.1007/s11127-017-0479-z.

    Google Scholar 

  5. Chambers, D., McLaughlin, P. A., & Stanley, L. (2018b). Barriers to prosperity: The harmful impact of entry regulations on income inequality. Public Choice. https://doi.org/10.1007/s11127-018-0498-4.

    Google Scholar 

  6. Chambers, D., McLaughlin, P. A., & Stanley, L. (2018c). Regulation and poverty. Public Choice, this issue.

  7. Coffey, B., McLaughlin, P. A., & Tollison, R. D. (2012). Regulators and redskins. Public Choice, 153, 191–204.

    Article  Google Scholar 

  8. Coglianese, C. (2002). Empirical analysis and administrative law. University of Illinois Law Review, 4, 1111–1138.

    Google Scholar 

  9. Dawson, J. W., & Seater, J. J. (2013). Federal regulation and aggregate economic growth. Journal of Economic Growth, 18(2), 137–177.

    Article  Google Scholar 

  10. Ellig, J., & McLaughlin, P. A. (2016). The regulatory determinants of railroad safety. Review of Industrial Organization, 49(2), 371–398.

    Article  Google Scholar 

  11. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27–8, 861–874.

    Article  Google Scholar 

  12. Goldschlag, N., & Tabarrok, A. (2018). Is regulation to blame for the decline in American entrepreneurship? Economic Policy, 33(93), 5–44.

    Article  Google Scholar 

  13. Huang, J., & Ling, C. X. (2005). Using AUC and accuracty in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17–3, 299–310.

    Article  Google Scholar 

  14. Manish, G. P., & O’Reilly, C. (2018). Banking regulation, regulatory capture, and inequality. Public Choice. https://doi.org/10.1007/s11127-018-0501-0.

    Google Scholar 

  15. Mulholland, S. E. (2018). Stratification by regulation. Public Choice, this issue. https://doi.org/10.1007/s11127-018-0597-2.

    Google Scholar 

  16. Mulligan, C., & Shleifer, A. (2005). The extent of the market and the supply of regulation. Quarterly Journal of Economics, 120, 1445–1473.

    Article  Google Scholar 

  17. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  18. Pizzola, B. (2018). Business regulation and business investment: Evidence from US manufacturing 1970–2009. Journal of Regulatory Economics, 53(3), 243–255.

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Patrick A. McLaughlin.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

McLaughlin, P.A., Sherouse, O. RegData 2.2: a panel dataset on US federal regulations. Public Choice 180, 43–55 (2019). https://doi.org/10.1007/s11127-018-0600-y

Download citation

Keywords

  • RegData
  • Regulation
  • Policy analytics
  • QuantGov
  • Machine learning