Data Structures

Allerhand, Mike

doi:10.1007/978-3-642-17980-8_2

Mike Allerhand²

Part of the book series: SpringerBriefs in Statistics ((BRIEFSSTATIST))

6847 Accesses

Abstract

Programming with R objects and data structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A vector is the simplest data structure in R. Scalar values are treated as single-cell vectors. A vector can be thought of either as a row or column since under matrix multiplication in R it is interpreted in whichever way makes it conformable with the other argument.
2.
R also provides a "complex? type for complex numbers, and a ?raw? type for bits. Cells with missing values may contain the special value ?Not Available? ( NA ).
3.
Sometimes decimal numbers may be printed in exponential form, such as: 1.5e-08 . This notation is used to print very small or very large numbers. The number 1.5e-08 is "1.5 times 10 to the ?8?, in other words 1.5/100000000.
4.
A string is a sequence of one or more keyboard characters, including spaces, and control characters such as $\backslash$ n (newline) and $\backslash$ t (tab), enclosed within quote marks, (double- or single-quotes). See help(backquote) for the list of control characters and further information about quote marks. Each cell of a character vector contains a string. For example: "apple", "apple", ?orange? .
5.
Logical values are types of data returned by conditional expressions such as 3 > 2 , ( TRUE as 3 is greater than 2), and "apple" > "orange" , ( FALSE as "a" is alphabetically less than "o" ). TRUE and FALSE can be abbreviated to T and F . Note: don?t make variables with these names as they will mask the abbreviations.
6.
The rules for type coercion are as follows: logical => numeric => character . For example mixtures of numeric and character data are forced to character data, in which case all numbers become quoted strings, such as: "3.14" . The one exception to this is the special value NA (Not Available) used to signify a missing value. Functions with names that begin "is." are provided to test the type, and functions with names that begin "as." can be used to coerce the type. See apropos (?$\,\hat{}$ is $\backslash\backslash$.") and apropos ("$\hat{}$ as $\backslash\backslash$."). as.numeric : Number strings => numbers. Non-number strings =>NA . TRUE=>1 , FALSE=>0 . as.character : Numbers => number strings. TRUE=>"TRUE" , FALSE=>"FALSE" . as.logical : 0=>FALSE , all non-zero numbers =>TRUE . Character strings =>NA .
7.
See the seq function for making a sequence in fractional steps.
8.
An "attribute" is a named piece of additional information attached to a data structure. Some attributes are part of the language in the sense that R functions understand them, notably names (named elements or columns) and dim (dimensions of matrices and data frames). Functions called names and dim are provided to get and set these attributes. However an attribute can also be any information you like, such as a comment or note attached to a data structure. Function attr is provided to get or set any individual attribute, and function attributes to get or set an object?s list of attributes. Function as.numeric has the side effect of removing attributes. Function c also removes attributes except for names .
9.
Most R packages provide some data sets as well as functions. Use function data() to see the data sets that are loaded by default. Data sets have help pages. For example the page describing the structure and variables of the data set named swiss is displayed by help(swiss) .
10.
The row names are just row numbers by default, but the attribute may be assigned a character vector of names using function rownames . Use names or colnames to get or set the columns names.
11.
See also: month.name and month.abb for the full and abbreviated names of the months; date , Sys.Date , and Sys.time for date and time strings.
12.
See function: strsplit for splitting strings into vectors.
13.
String pattern matching uses regular expressions. See: help(regex) . See function substr for extracting and replacing characters at a given position within the string.
14.
A different sample is drawn each time a sampling function is called because the seed of the internal random number generator is updated automatically. If you want to draw the same sample you can set the seed. See help(set.seed) .
15.
Equivalent functions for other distribution families include: rt, rf, rbinom, rpois, rchisq, rexp, rweibull, rgeom, rhyper, rlogis, rbeta, rgamma .
16.
See: read.table , readLines , scan , and related functions.
17.
Functions are provided in the foreign library to read data in several proprietary formats, including SAS, Stata, and SPSS. It is also possible to read data from certain databases such as MySQL. See the R Data Import/Export link from the HTML documentation displayed by help.start() .
18.
If R displays a message that there is "No such file or directory" the most likely explanation is that your working directory is not pointing to the folder that contains the file. Set your working directory, or provide an absolute pathname to the file. Pathnames should either use forward slashes as in "C:/path/to/file" , or double backslashes as in "C: $\backslash\backslash$ path $\backslash\backslash$ to $\backslash\backslash$ file" .
19.
See: help(connections) and help(clipboard) .
20.
Several variants of read.table are provided with slightly different defaults. See help(read.table) .
21.
If a column of numbers is intended to be a grouping indicator then it is necessary explicitly to convert the column to a factor, for example using function as.factor .
22.
Set argument stringsAsFactors=FALSE to override the default behaviour and keep columns of character data as character vectors. Use command options(stringsAsFactors = FALSE) if you wish to set this default behaviour globally. If a column of numbers contains characters such as ?.? to signify a missing value, then the whole column is taken to be characters and converted to a factor. Set argument na.strings to ?.? to override this and recode all ?.? as NA , (see the section on Missing Values).
23.
Functions are designed where possible to allow different kinds of input data, and to implement the generic meaning of the function in whatever way makes best sense for the kind of data they are given. For example the description of the arguments in help(cor) suggests they can be a numeric vector, matrix or data frame.
24.
See: help(Arithmetic) , help(Comparison) , and help(Syntax) .
25.
A warning message about fractional recycling is displayed if the length is not an exact multiple. This can safely be ignored, or turned off by the command: options(warn = ?1) .
26.
Matrix multiplication x %*% y is conformable if ncol(x) == nrow(y) . If one argument is a vector it is interpreted as a row or column to suit so a transpose is unnecessary.
27.
Annoyance: when testing for negative numbers an expression like x<-1 will unexpectedly modify x because the <- is interpreted as the assignment operator. The workaround is to include space: x < ?1 .
28.
& and | are vectorized for combining logical index vectors. Corresponding operators && and || result in single truth values, usually for purposes of flow control in a program. See help(Logic, package="base") , and the examples therein which show how to construct truth tables defining the logical operations.
29.
To coerce a factor with numeric labels as a numeric vector first coerce as character using as.character and then as numeric using as.numeric .
30.
There are programming advantages to abstracting the grouping information in this way: it provides a device for manipulating groupings that does not depend upon a particular shape or layout for data, that can accommodate missing observations, and that enables programming easily to express conditions on the groupings, such as to extract and summarise the observations in a particular group.
31.
An ordered factor does not refer to the order of the level labels, but simply marks the factor as "ordered" so it can be handled appropriately by functions where the distinction between nominal and ordinal data is relevant. For example the contrast coding used in linear modelling functions has different defaults for unordered and ordered factors. Unordered factors get comparisons between group means. Ordered factors get trend analysis.
32.
The order is relevant to the default appearance of tables and graphs, and also in modelling functions when a reference level is used for comparisons. See also function relevel to re-order levels so a given level is the first (reference) level.
33.
See: help(Extract) , help("[.data.frame") , and help(subset) .
34.
See: help(names) , and also help(colnames) , help(dimnames) , and help(row.names) .
35.
Caveat: assignments to attached variables are not assignments to the data frame. Attaching a data frame works by inserting the object on the search path so the variables within the data frame can be found by name. See help(search) . However the global environment is always searched first, so variables in the data frame may be masked by variables in the global environment. Any assignment to variables of the same name assigns in the global environment and not in the data frame. So assignments to attached variables have no effect on the data frame.
36.
Columns may only be appended. A new column is not allowed to leave "holes" after existing columns.
37.
See: help("[.factor") .
38.
See also functions split and unsplit which split a vector or data frame into a list, or combine list components.
39.
See also functions melt and cast in package reshape .
40.
See package mvnmle for maximum likelihood imputation, and various packages for multiple imputation such as mitools and mice listed at http://cran.r-project.org/web/views/Multivariate.html.
41.
See: help(na.fail) and help(naresid) . Note that na.exclude works by passing a message, via the result?s na.action attribute, to functions that subsequently process the result. To take advantage of na.exclude it is necessary to process the result with an appropriate function. For example fitted(fit) and residuals(fit) propagate NA values, but fit$fitted and fit$residuals do not.
42.
See: help(Control) .
43.
See also: combn to generate combinations of elements, choose for the number of combinations, and factorial for the size of a full permutation. See permutations in package gtools for enumerating a full permutation.
44.
See: apropos("apply") .
45.
See the examples in help(by) .
46.
See: help("function") .
47.
See also: help(return) .

Author information

Authors and Affiliations

Department of Psychology, Centre for Cognitive Ageing and Cognitive Epidemiology, George Square 7, Edinburgh, EH8 9JZ, UK
Mike Allerhand

Authors

Mike Allerhand
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mike Allerhand .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Allerhand, M. (2011). Data Structures. In: A Tiny Handbook of R. SpringerBriefs in Statistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17980-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-17980-8_2
Published: 25 March 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17979-2
Online ISBN: 978-3-642-17980-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics