Getting to Know Thy Data

Titus, Marvin

doi:10.1007/978-3-030-60831-6_5

Marvin Titus⁷

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

503 Accesses

Abstract

This chapter discusses and demonstrates the importance of getting to know the data that we use to conduct higher education policy analysis and evaluation. More specifically, this chapter addresses the need to know the structure of datasets. The identification and exploration of missing data are also discussed in this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For a complete description of the storage types, see page 89 of Stata User’s Guide Release 16.
2.
For more information on compress, see pages 77–78 of the Stata User’s Guide Release 16.
3.
If we are using Stata/IC, then the maximum number of variables is 798. If we are using Stata/MP, then the maximum number of variables is 65,532. In this example, we are using Stata/SE which has as a maximum 10,998 variables.
4.
These data can be found at: https://sheeo.org/project/state-higher-education-finance/.
5.
The egen command, which is short for extensions to generate, can be employed to create variables that also require an additional function. For a detailed explanation of the egen command, see the pages 203–223 of the Stata User’s Guide Release 16.
6.
For the full documentation for missings, see Cox (2015).

References

Cox, N. J. (2015). Speaking Stata: A set of utilities for managing missing values. The Stata Journal , 15(4), 1174–1185.
Article Google Scholar
Li, C. (2013). Little’s Test of Missing Completely at Random. The Stata Journal , 13(4), 795–809. https://doi.org/10.1177/1536867X1301300407
Little, R. J. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association , 83(404), 1198–1202.
Article MathSciNet Google Scholar
Medeiros, R. A., & Blanchette, D. (2011). MDESC: Stata module to tabulate prevalence of missing values. In Statistical Software Components . Boston College Department of Economics. https://ideas.repec.org/c/boc/bocode/s457318.html
Nguyen, M. C. (2008). XTMIS: Stata module to report missing observations for each variable in xt data. In Statistical Software Components . Boston College Department of Economics. https://ideas.repec.org/c/boc/bocode/s456945.html
Schpero, W. L. (2018). STATASTATES: Stata module to add US state identifiers to dataset. In Statistical Software Components. Boston College Department of Economics. https://ideas.repec.org/c/boc/bocode/s458205.html
Google Scholar
StataCorp. (2019). Stata User’s Guide Release 16 . Stata Press.

Download references

Author information

Authors and Affiliations

Counseling, Special, and Higher Education, University of Maryland, College Park, MD, USA
Marvin Titus

Authors

Marvin Titus
View author publications
You can also search for this author in PubMed Google Scholar

5.6 Appendix

*Chapter 5 Syntax *use time series data from Chap. 4. cd "C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 4\Stata" use "Percent of US high school graduates in PSE, 1960 to 2016.dta" *examine structure of the dataset describe *reduce the amount of memory required by float by compressing the data using /// compress *compare after compressing, show structure describe *open panel dataset cd "C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata files" use "Example 5.0.dta" *compress the data and show structure compress describe * recast int id describe *save save "Example 5.0.dta", replace *clear all *using a large amount of data from secondary data sources such as the /// National Center for Education Statistics’ NCES /// public-use High School Longitudinal Study of 2009 (HSLS:09) student dataset *set the maximum variables to 10,0000 set maxvar 10000 *download all student data from the HSLS:09 dataset in Stata *examine a shortened version of the HSLS:09 dataset’s structure describe, short *look at how much memory this dataset uses memory *try to see if we can compress the data compress *close dataset clear all *import an Excel file (SHEEO finance data) cd "C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Excel files" import excel /// "SHEEO_SHEF_FY18_Nominal_Data.xlsx", sheet("State and U.S. Nominal Data (2") firstrow *Because we want to use only post-Great Recession data, we drop observations /// if they are prior to fiscal year (FY) 2010 or if less than FY 2010. drop if FY<2010 *We use the list command to take a quick look at the data, particularly /// with respect to FY 2010. We will also make the command /// conditional by using list if FY==2010 *Because we want only states in our dataset, we drop all observations for /// the U.S. total and Washington DC. drop if State=="US" drop if State=="Washington DC" *we employ the user-created statastates (Schpero 2018) program to create /// fips codes and other state identifiers; include nogenerate option to /// prevent the generation the variable _merge statastates, name(State) nogenerate *To create a variable, stateid, based on state names, we use egen. egen stateid = group(State) *We use the compress command to save computer memory. compress *we use stateid and FY to declare the dataset to be a panel xtset stateid FY, yearly *data are saved to a file with a new name. cd "C:\Users\Marvin\Dropbox\Manuscripts\Book\Chapter 5\Stata files" save "Example 5.2.dta" clear all *Using selected variables from the public-use version of the HSLS:09 /// at we saved in a Stata file (i.e., Example 5.3) use "Example 5.3.dta" * determine if and how missing data are coded of the variable S3CLGPELL Codebook S3CLGPELL *before conducting missing data analysis, we would need to /// change to change -9 to “. “ for all variables mvdecode _all, mv(-9=.) *Missing data analysis *install Stata user-created program, mdesc (Medeiros and Blanchette 2011) install ssc mdesc, replace *produce a table with the number of missing values, total number of cases, /// and percent missing for each variable in our file. mdesc *use the Stata command misstable tree, with various options, to show the /// pattern of “missingness” in the data misstable tree *use Stata command, misstable misstable patterns *use another option, misstable tree, frequency *conduct missing data analysis employing Stata user-created /// routine "xtmis" (Nguyen 2008). The program must be installed by typing: ssc install tomata ssc install xtmis * use the Stata command "tostring" to create a string variable (unitid_s), /// based on the numeric IPEDS variable (unitid). Then we invoke xtmis. tostring unitid, generate(unitid_s) xtmis grantlow , id(unitid_s) * Stata user-created "missings" command; install most recent version /// net install dm0085_1.pkg, replace *examine missing data by SES quartiles bysort X1SESQ5 : missings table *same command can be repeated by racial-ethnic groups. bysort X1RACE : missings table *Missing Data - Missing Completely at Random *install the Stata user-written program, mcartest (Li 2013) net install st0318.pkg, replace *set maximum variables to 10,000 and Open a large dataset – public /// use version of the HSLS09 set maxvar 10000 use "HSLS09.dta" *keep selected variables keep STU_ID X1SEX X1RACE X1SES X1SESQ5 X4ATPRLVLA S3CLGPELL P1TUITION *convert code (-9) for missing data to “.” mvdecode _all, mv(-9=.) *test assumption of missing completely at random (MCAR) of two variables mcartest S3CLGPELL P1TUITION *add covariates to test the covariate-dependent missingness (CDM) mcartest S3CLGPELL P1TUITION = i.X1RACE if X1RACE !=. , unequal emoutput nolog *exit Stata exit *end

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Titus, M. (2021). Getting to Know Thy Data. In: Higher Education Policy Analysis Using Quantitative Techniques . Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-60831-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-60831-6_5
Published: 15 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60830-9
Online ISBN: 978-3-030-60831-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Getting to Know Thy Data

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

5.6 Appendix

5.6 Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation