Vector
Overview
Atomic Vectors
- logical
- numeric: integer, double
- character
Scalar
<- c(TRUE, FALSE)
lgl_var <- c(1L, 6L, 10L)
int_var <- c(1, 2.5, 4.5)
dbl_var <- c("these are", "some strings") chr_var
typeof(lgl_var)
#> [1] "logical"
typeof(int_var)
#> [1] "integer"
typeof(dbl_var)
#> [1] "double"
typeof(chr_var)
#> [1] "character"
Longer Vector
c()
will flattens
c(c(1, 2), c(3, 4))
#> [1] 1 2 3 4
Missing Value
NA
will propagate except
NA ^ 0
#> [1] 1
NA | TRUE
#> [1] TRUE
NA & FALSE
#> [1] FALSE
Checking NA
Don’t use this to check NA
<- c(NA, 5)
x == NA
x #> [1] NA NA
Use this to check NA
is.na(x)
#> [1] TRUE FALSE
Types of NA
typeof(NA_integer_)
#> [1] "integer"
typeof(NA_real_)
#> [1] "double"
typeof(NA_character_)
#> [1] "character"
This distinction is usually unimportant because NA will be automatically coerced to the correct type when needed.
Testing
Avoid is.vector()
, is.atomic()
, and is.numeric()
: they don’t test if you have a vector, atomic vector, or numeric vector; you’ll need to carefully read the documentation to figure out what they actually do.
Use this.
is.logical(T)
#> [1] TRUE
is.integer(1L)
#> [1] TRUE
is.double(2)
#> [1] TRUE
is.character("Hi")
#> [1] TRUE
Coercion
Combining different types of atomic vector they will be coerced in this order
str(c(F, 1))
#> num [1:2] 0 1
str(c(1, "a"))
#> chr [1:2] "1" "a"
Coerce Logical to Numeric can be useful
<- c(FALSE, FALSE, TRUE)
x as.numeric(x)
#> [1] 0 0 1
# Total number of TRUEs
sum(x)
#> [1] 1
# Proportion that are TRUE
mean(x)
#> [1] 0.3333333
Deliberate coercion with warning message
as.integer(c("1", "1.5", "a"))
#> Warning: NAs introduced by coercion
#> [1] 1 1 NA
Attributes
Set and Get Attributes
Set & Get specific attributes: attr()
Set all attributes: structure()
Get all attributes: attributes()
<- 1:3
a attr(a, "x") <- "abcdef"
attr(a, "x")
#> [1] "abcdef"
attr(a, "y") <- 4:6
str(attributes(a))
#> List of 2
#> $ x: chr "abcdef"
#> $ y: int [1:3] 4 5 6
# Or equivalently
<- structure(
a 1:3,
x = "abcdef",
y = 4:6
)str(attributes(a))
#> List of 2
#> $ x: chr "abcdef"
#> $ y: int [1:3] 4 5 6
Attributes should generally be thought of as ephemeral.
attributes(a[1])
#> NULL
attributes(sum(a))
#> NULL
There are only two attributes that are routinely preserved:
names
dim
Create S3 class to preserve other attributes.
Names
You can name a vector in three ways:
# When creating it:
<- c(a = 1, b = 2, c = 3)
x
# By assigning a character vector to names()
<- 1:3
x names(x) <- c("a", "b", "c")
# Inline, with setNames():
<- setNames(1:3, c("a", "b", "c")) x
Remove names from a vector by using x <- unname(x)
or names(x) <- NULL
.
<- c(a = 1, 2)
y names(y)
#> [1] "a" ""
<- unname(y)
y names(y)
#> NULL
missing names may be either “” or NA_character_
. If all names are missing, names() will return NULL.
Dimensions
Adding a dim
attribute to a vector allows it to behave like a 2-dimensional matrix or a multi-dimensional array.
matrix()
# Two scalar arguments specify row and column sizes
<- matrix(1:6, nrow = 2, ncol = 3)
x
x#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
array()
# One vector argument to describe all dimensions
<- array(1:12, c(2, 3, 2))
y
y#> , , 1
#>
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
#>
#> , , 2
#>
#> [,1] [,2] [,3]
#> [1,] 7 9 11
#> [2,] 8 10 12
dim()
# You can also modify an object in place by setting dim()
<- 1:6
z dim(z) <- c(3, 2)
z#> [,1] [,2]
#> [1,] 1 4
#> [2,] 2 5
#> [3,] 3 6
Many of the functions for working with vectors have generalisations for matrices and arrays:
Vector | Matrix | Array |
---|---|---|
names() |
rownames() , colnames() |
dimnames() |
length() |
nrow() , ncol() |
dim() |
c() |
rbind() , cbind() |
abind::abind() |
— | t() |
aperm() |
is.null(dim(x)) |
is.matrix() |
is.array() |
1-dimension, but not the same
str(1:3) # 1d vector
#> int [1:3] 1 2 3
str(matrix(1:3, ncol = 1)) # column vector
#> int [1:3, 1] 1 2 3
str(matrix(1:3, nrow = 1)) # row vector
#> int [1, 1:3] 1 2 3
str(array(1:3, 3)) # "array" vector
#> int [1:3(1d)] 1 2 3
S3 Atomic Vectors
Factors
Factors are built on top of an integer vector, can contain only predefined values.
Has two attributes
- class: “factor”
- levels: defines the set of allowed values.
<- factor(c("a", "b", "b", "a"))
x
x#> [1] a b b a
#> Levels: a b
typeof(x)
#> [1] "integer"
attributes(x)
#> $levels
#> [1] "a" "b"
#>
#> $class
#> [1] "factor"
When you tabulate a factor you’ll get counts of all categories, even unobserved ones:
# Character
<- c("m", "m", "m")
sex_char table(sex_char)
#> sex_char
#> m
#> 3
# Factor
<- factor(sex_char, levels = c("m", "f"))
sex_factor table(sex_factor)
#> sex_factor
#> m f
#> 3 0
Ordered factors
They behave like regular factors, but the order of the levels is meaningful (leveraged by some modelling and visualisation functions)
<- ordered(c("b", "b", "a", "c"), levels = c("c", "b", "a"))
grade
grade#> [1] b b a c
#> Levels: c < b < a
Best to explicitly convert factors to character vectors if you need string-like behaviour.
Date
Date vectors are built on top of double vectors. They have class “Date” and no other attributes:
<- Sys.Date()
today
typeof(today)
#> [1] "double"
attributes(today)
#> $class
#> [1] "Date"
The value of the double represents the number of days since 1970-01-01.
<- as.Date("1970-02-01")
date unclass(date)
#> [1] 31
Date-times
value represents the number of seconds since 1970-01-01.
<- as.POSIXct("2018-08-01 22:00", tz = "UTC")
now_ct
now_ct#> [1] "2018-08-01 22:00:00 UTC"
typeof(now_ct)
#> [1] "double"
attributes(now_ct)
#> $class
#> [1] "POSIXct" "POSIXt"
#>
#> $tzone
#> [1] "UTC"
unclass(now_ct)
#> [1] 1533160800
#> attr(,"tzone")
#> [1] "UTC"
tzone
attribute controls only how the date-time is formatted.
Sys.timezone() # My time zone
#> [1] "Asia/Bangkok"
structure(now_ct, tzone = "America/New_York")
#> [1] "2018-08-01 18:00:00 EDT"
structure(now_ct, tzone = "Asia/Bangkok")
#> [1] "2018-08-02 05:00:00 +07"
Difftimes
Difftimes are built on top of doubles, and have a units
attribute that determines how the integer should be interpreted
<- as.difftime(1, units = "weeks")
one_week_1
one_week_1#> Time difference of 1 weeks
typeof(one_week_1)
#> [1] "double"
attributes(one_week_1)
#> $class
#> [1] "difftime"
#>
#> $units
#> [1] "weeks"
Lists
Lists are a step up in complexity from atomic vectors: each element can be any type, not just vectors.
Create List
<- list(
l1 1:3,
"a",
c(TRUE, FALSE, TRUE),
c(2.3, 5.9)
)
typeof(l1)
#> [1] "list"
str(l1)
#> List of 4
#> $ : int [1:3] 1 2 3
#> $ : chr "a"
#> $ : logi [1:3] TRUE FALSE TRUE
#> $ : num [1:2] 2.3 5.9
Elements of a list are references, the total size of a list might be smaller than you might expect.
::obj_size(mtcars)
lobstr#> Error in loadNamespace(x): there is no package called 'lobstr'
<- list(mtcars, mtcars, mtcars, mtcars)
l2 ::obj_size(l2)
lobstr#> Error in loadNamespace(x): there is no package called 'lobstr'
c()
will coerce the vectors to lists before combining them into list
<- c(
l5 list(1, 2),
c(3, 4)
)
str(l5)
#> List of 4
#> $ : num 1
#> $ : num 2
#> $ : num 3
#> $ : num 4
As List
list(1:2)
#> [[1]]
#> [1] 1 2
as.list(1:2)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
List Matrix
With lists, the dimension attribute can be used to create list-matrices or list-arrays:
<- list(1:3, "a", TRUE, 1.0)
l dim(l) <- c(2, 2)
l#> [,1] [,2]
#> [1,] integer,3 TRUE
#> [2,] "a" 1
Data Frame
data.frame
A data frame is a named list of vectors with the same length.
Attributes
names
row.names
<- data.frame(x = 1:3, y = letters[1:3])
df1 typeof(df1)
#> [1] "list"
attributes(df1)
#> $names
#> [1] "x" "y"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1 2 3
A data frame has
rownames()
andcolnames()
. Thenames()
of a data frame are the column names.A data frame has
nrow()
rows andncol()
columns. Thelength()
of a data frame gives the number of columns.
tibble
Tibbles are lazy and surly: they do less and complain more.
library(tibble)
<- tibble(x = 1:3, y = letters[1:3])
df2 typeof(df2)
#> [1] "list"
attributes(df2)
#> $class
#> [1] "tbl_df" "tbl" "data.frame"
#>
#> $row.names
#> [1] 1 2 3
#>
#> $names
#> [1] "x" "y"
Tibble vs Data Frame
- Tibbles never coerce their input (but recent version of R data frame also not convert string to factor)
- Tibbles do not transform non-syntactic names
- Tibbles will only recycle vectors of length one.
- Tibbles allows you to refer to variables created during construction:
<- tibble(
df2 x = 1:3,
y = c("a", "b", "c")
)
str(df2)
#> tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
#> $ x: int [1:3] 1 2 3
#> $ y: chr [1:3] "a" "b" "c"
names(data.frame(`1` = 1))
#> [1] "X1"
names(tibble(`1` = 1))
#> [1] "1"
# Useful feature
tibble(
x = 1:3,
y = x * 2
)#> # A tibble: 3 × 2
#> x y
#> <int> <dbl>
#> 1 1 2
#> 2 2 4
#> 3 3 6
Rowname
Row names are undesirable
Convert rowname to column by
rownames_to_column()
as_tibble()
withrownames
argument
<- data.frame(
df3 age = c(35, 27, 18),
hair = c("blond", "brown", "black"),
row.names = c("Bob", "Susan", "Sam")
)
as_tibble(df3, rownames = "name")
#> # A tibble: 3 × 3
#> name age hair
#> <chr> <dbl> <chr>
#> 1 Bob 35 blond
#> 2 Susan 27 brown
#> 3 Sam 18 black
Subsetting Caveat
data.frame
allowed partial matching when subsetting with $
. It can be a source of bug.
$a
df3#> [1] 35 27 18
# If not found
$x
df3#> NULL
If you want a single column, recommend using df[["col"]]
List column in Tibble
tibble(
x = 1:3,
y = list(1:2, 1:3, 1:4)
)#> # A tibble: 3 × 2
#> x y
#> <int> <list>
#> 1 1 <int [2]>
#> 2 2 <int [3]>
#> 3 3 <int [4]>
NULL
typeof(NULL)
#> [1] "NULL"
length(NULL)
#> [1] 0
<- NULL
x attr(x, "y") <- 1
#> Error in attr(x, "y") <- 1: attempt to set an attribute on NULL
two common uses of NULL
:
To represent an empty vector (a vector of length zero) of arbitrary type.
To represent an absent vector. For example,
NULL
is often used as a default function argument, when the argument is optional but the default value requires some computation. (Contrast this withNA
which is used to indicate that an element of a vector is absent.)