Chapter 4 Vectors
In this chapter we will learn about one of the most basic objects in R, socalled vectors. Before starting, let us do a quick detour to “objects”.
4.1 Objects in R
R is an objectoriented programming language with the fundamental design principle: Everything in R is an object. In R, objects can be:
 Variables (e.g.,
a
,b
,result
).  Functions (e.g.,
mean()
,max()
,sin()
).  Connection handlers.
 …
Remember the very simple example from the chapter First Steps where we calculated \(a^2\) as follows?
5
a <^2 a
We assign 5
to a
. This automatically creates a new object (a
) and stores
the desired value in it (here 5
).
Object orientation: All objects in R have a specific class that can
have a set of classspecific attributes as well as methods for certain
generic functions. In this simple example, a
gets a numeric vector of
length 1 (i.e., containing one numeric value).
Know the basics!
Vectors are sometimes also referred to as the ‘nuts & bolts’ in R as they build the basis for all complex objects such as data frames or fitted regression models. The image below shows a (simplified) schematic overview of how various more complex objects (on the right hand side) are based on vectors (on the left hand side).
Frequentlyused data types can also be distinguished by their dimensionality (1d, 2d, nd) and whether they are homogenous (with all elements of the same type) vs. heterogenous (with elements of potentially different types).
Dimension  Homogenous  Heterogenous 

1  Atomic vectors  Lists 
2  Matrix  Data frame 
\(\ge 1\)  Array 
In this course, we will learn about vectors (atomic vectors), matrices (2d arrays), factors, date and date time objects, lists, and data frames. In other situations, you will most certainly come across many other more complex objects, but keep in mind that most of them are created by using basic vectors (i.e., lists and atomic vectors).
Thus: Knowing these basics is essential to becoming a good programmer!
Print objects
When working on the interactive R console we can always print objects to see what is stored in them. This can be done by simply entering the name of the object and pressing enter:
a
## [1] 5
What happens is ‘implicit printing’ (we have not explicitly asked for it).
Explicit printing would be if we call the print()
function:
print(a)
## [1] 5
Both have the same effect when working on the interactive console. When writing
scripts, the print()
function becomes more important as implicit printing will
no longer work (or only in special situations).
One way to (i) create new objects and (ii) directly print them is by adding another pair of round brackets around the command. Instead of doing it in two steps
5^2
a < a
## [1] 25
we can call (a < 5^2)
. The additional brackets cause implicit printing
because the result of the assignment to a
is evaluated, thus creating
the same effect as the two separate lines above.
5^2) (a <
## [1] 25
This is sometimes useful for brevity, avoiding additional typing and saving some space in the displayed code.
4.2 (Atomic) vectors
A vector is nothing else than a sequence of elements of a certain type. R distinguishes vectors with two different modes.
 Atomic vectors: All elements must have the same basic type (e.g., numeric, character, …).
 Lists: Special vector mode. Different elements can have different types.
Lists are deferred to a later chapter and are not important for now. Also, we follow standard albeit somewhat sloppy jargon and refer to atomic vectors simply as vectors while referring to vectors of mode list explicitly as lists (rather than vectors).
(Atomic) vectors are the most basic objects in R as they can contain only data of one type (e.g., only numeric values, or only character strings, etc.). Six different types of data can be stored in atomic vectors.
Type  Example  Comment  

1  double (or numeric)  0.5 , 120.9 , 5.0 
Floating point numbers with double precision 
2  integer  1L , 121L , 5L 
“Long” integers 
3  logical  TRUE , FALSE 
Boolean 
4  character  "R" , "5" or 'R' , '5' 
Text 
5  complex  5+11i , 3+2i , 0+4i 
Real+imaginary numbers 
6  raw  01 , ff 
Raw bytes (as hexadecimal) 
Important vector functions
In programming, functions are used to perform a specific task, e.g., manipulate an object, calculate a derived quantity, or investigate existing objects. A few of the most important ones for creating and investigating simple vectors are:
c()
: Combines multiple elements into one atomic vector.length()
: Returns the length (number of elements) of an object.class()
: Returns the class of an object.typeof()
: Returns the type of an object. There is a small (sometimes important) difference betweentypeof()
andclass()
as we will see later.attributes()
: Returns further metadata of arbitrary type.
First example: Let us start with a simple example and assign (<
) the
value 1.5
(a numeric floating point number) to an object named x
.
1.5 x <
We can now use the functions above to investigate this vector and check what the functions return.
length(x) # Length of the vector.
## [1] 1
class(x) # What is the class of the object?
## [1] "numeric"
typeof(x) # What is the type of this vector?
## [1] "double"
 Length:
1.5
is a single numeric value. Thus,x
is a vector with only one element and has length1
. In R, a single element is always a vector of length 1, there is no special object for single values.  Class: The object is of class “numeric”.
 Type: While the class is “numeric” the underlying type is “double”. We will come back to what that actually means later.
Second example: We create a new atomic vector with two elements, both
integers (as indicated by the L
suffix for long integers). c()
is the function which combines both values to a
new vector, which we (again) store in x
. Whatever has been in x
previously
will be lost, as we overwrite the object.
c(111L, 555L) # Create the vector; overwrite 'x'
x <length(x) # Checking length
## [1] 2
class(x) # Checking class
## [1] "integer"
typeof(x) # Checking the type
## [1] "integer"
 Length: As expected,
length()
now returns2
as we have stored two elements.  Class and type: Both functions return
"integer"
.
Third example: As a last example let us create a character vector. A character (or character string)
is nothing else but “text”, declared in R either by double quote "..."
or single quotes '...'
.
In this case we store two character strings with short text snippets in a new vector called y
:
c("this is", "just some text")
y <length(y)
## [1] 2
class(y)
## [1] "character"
typeof(y)
## [1] "character"
 Length: The length is 2 as our atomic vector
y
contains two character strings ("this is"
and"just some text"
). Note thatlength()
does not count the number of individual characters, but the number of elements in our vectory
. Feel free to trynchar(y)
and note the difference.  Class and type: Twice the same, class and type are identical.
Double (or numeric)
One atomic data class is the double class (double precision floating
point values). Double is actually not often used in R, objects
of type double are simply called numeric and the class of
floating point numbers will be "numeric"
.
1.5
x < x
## [1] 1.5
Note: the number 10
will also be numeric (double) as 10
is interpreted
as 10.000
.
10) (x <
## [1] 10
class(x)
## [1] "numeric"
Integers
A second atomic class is the integer class. Integers can explicitly be defined
using the L
(“Long”) suffix. While 10
will be double (or numeric;
10.0
), 10L
defines an integer (natural number).
10L) (x <
## [1] 10
class(x)
## [1] "integer"
This is not important most of the time when you are using R. Except when it is! There are some situations where integers are required!
Logical
Logicals are rather simple as they can only take
up two different values: either TRUE
or FALSE
.
TRUE) (x <
## [1] TRUE
FALSE) (x <
## [1] FALSE
class(x)
## [1] "logical"
Note: TRUE
and FALSE
must be written without quotation marks (they are logicals not characters).
Characters
The last atomic class we will deal with is the character class. A character is basically just a text and can be of any length.
"Austria") (x <
## [1] "Austria"
"Max Konstantin Mustermann (AT)") (x <
## [1] "Max Konstantin Mustermann (AT)"
class(x)
## [1] "character"
Checking class/type
As shown, class()
and typeof()
can be used to determine the class and type of an object.
Both functions return a character vector which contains the name of the
class/type (e.g., "numeric"
or "integer"
).
Alternatively, a range of functions exist to check whether or not an object is
of a specific type. These very handy functions return either a logical TRUE
if the input is of this specific type, or FALSE
if not. Examples are:
is.double()
is.numeric()
is.integer()
is.logical()
is.character()
is.vector()
 … and many more.
Some examples:
is.integer(1.5)
## [1] FALSE
is.double(1.5)
## [1] TRUE
is.character("Bruno")
## [1] TRUE
is.vector(c(1, 2, 3, 4))
## [1] TRUE
These functions return to us one single logical value TRUE
if the object
is of a specific class/type, and a FALSE
if not.
Double vs. numeric
Let’s make things a bit more complicated. If you are new to programming this might be a bit confusing. However, at some point it will get important to know that there is a small but important difference between “numeric” and “double”.
For those who already have a background in programming I would like to clarify
the language. If you are coming from another programming language you might be
familiar with objects of type “double”, “real”, “float”, and “integer”. A “double” is a double
precision floating point numeric value (e.g., 0.001234
). Depending on the programming
language they might be referred to as “double”, “real”, or “float” which are basically the same.
An “integer” is also a numeric value but from \(\pm \mathcal{N}_0\)
([..., 3, 2, 1, 0, 1, 2, 3, ...]
; positive and negative natural numbers including zero).
As we have learned, R is based on C and, thus, shares the data types
“double” and “integer” with C. However, there is a difference between the
class of an object in R and the type (which is the underlying type).
Let us check what class()
and typeof()
returns for 1
, 1.0
, and 1L
:
x  class(x)  typeof(x)  is.double(x)  is.integer(x)  is.numeric(x) 

1 
numeric  double  TRUE  FALSE  TRUE 
1.0 
numeric  double  TRUE  FALSE  TRUE 
1L 
integer  integer  FALSE  TRUE  TRUE 
The table shows (left to right) the value which is checked (x
),
the R class of the object x
, the type of x
, and three checks: whether the
value x
is double, numeric, or integer (is.*()
).
1
and1.0
: are of class “numeric”. The underlying data type in C is “double”, thus bothis.double()
andis.numeric()
returnTRUE
.1L
: this is how an integer is specified in R.1L
is of class “integer” and the underlying C type is also “integer”, however, an integer in R is also threated as “numeric”! Thusis.numeric()
returnsTRUE
as well.
Why? Well, the philosophy is that one can use both for mathematical operations.
In R there is no difference between 5L^2L
, 5L^2.0
, and 5.0^2L
as in some other
programming languages. Thus, is.numeric()
basically tells us whether or not we can use this
specific object (x
) for mathematical calculations.
Keep this in mind, we may get back to it here and there, but we will not use the small but sometimes important difference between “double” and “numeric” often.
Missing values
One additional important element is the missing values. R knows two types of
missing values: NaN
and NA
. NaN
is “not a number” and results from
mathematical operations which are illegal/invalid.
NA
on the other hand stands for ‘not available’ and indicates a missing
value. NA
s can occur in all the vectors discussed above, and keep their class.
NaN
: Mathematically not defined (and always of classnumeric
).NA
: Missing value,NA
’s still have classes!NaN
is alsoNA
but not vice versa.
Example for NaN
: Zero devided by zero (0 / 0
) is not a valid
mathematical operation, wherefore the result of the devision is NaN
.
0 / 0) (x <
## [1] NaN
Let us check the class and see if the object x
is ‘not a valid number’
(is.nan()
) and/or a missing value (is.na()
).
is.nan(x) # Check if 'x' is NaN
## [1] TRUE
is.na(x) # Check if 'x' is NA
## [1] TRUE
Note: Missing values in R still have a class. We can have missing numeric, integer, logical, or character missing values. They all look the same, but have a class attached to it.
class(x) # The value from above
## [1] "numeric"
When defining a new object and storing a single missing value on it (with x < NA
),
it will be ‘logical’ by default. We can, however, convert them (if needed).
# By default  logical
NA) (x <
## [1] NA
class(x)
## [1] "logical"
# Convert to integer
as.integer(NA)) (y <
## [1] NA
class(y)
## [1] "integer"
# Convert to character
as.character(NA)) (z <
## [1] NA
class(z)
## [1] "character"
As show, all three objects (x
, y
, z
) look identical and contain one
missing value. However, the class of the three objects differs. More about
missing values in the corresponding sections where they are getting important.
4.3 Creating vectors
Concatenating values c()
As quickly shown earlier, c()
can be used to combine (one or) multiple elements
to a new vector. An example:
c(1.5, 1000, 0.1)) (x <
## [1] 1.5 1000.0 0.1
This creates a numeric vector of length 3 with the three elements 1.5
,
1000
, and 0.1
(in this order).
Important: As mentioned at the very beginning of this chapter, vectors can only contain data of one type! We cannot easily mix e.g., a character string with numeric values or vice versa. It is possible, but we need to take care of what happens in this case. We will come back to that later (subsection Coercion (mixing objects)).
In the same way we can create vectors of other types, e.g., a character vector with three elements:
c("Austria", "Tyrol", "Innsbruck")) (y <
## [1] "Austria" "Tyrol" "Innsbruck"
As each element (e.g., "Austria"
) itself can be seen as a character vector of length one,
c()
can also be seen (and used) to combine vectors just the same way as used for elements.
c("Austria", "Tyrol", "Innsbruck")
x < c("Germany", "Berlin", "Berlin")
y < c(x, y)
z < z
## [1] "Austria" "Tyrol" "Innsbruck" "Germany" "Berlin" "Berlin"
length(z)
## [1] 6
We could, of course, repeat this example for all atomic vectors including doubles (numeric), logical, character, integer, and complex values (try it yourself!):
c(0.5, 0.6, 0.7) # double
x < c(TRUE, FALSE, TRUE) # logical
x < c(T, F, T) # logical (!!!)
x < c("a", "b", "c") # character
x < c(1L, 2L, 3L, 4L) # integer
x < 15:20 # integer
x < c(1+0i, 2+4i, 5+2i) # complex x <
Warning: The example above shows that you could also use T
and F
as an
abbreviation for TRUE
and FALSE
which you may see frequently when you ‘google’
something. Please don’t do that! T
and F
are
not protected and can be changed. Below we redefine or overwrite T
with
"bar"
and F
with "FOO"
. Thus, c(T, T, F, T)
becomes something
completely different. Even worse: Imagine what happens if I redefine
T < FALSE
and F < TRUE
.
c(T, T, F, T)
## [1] TRUE TRUE FALSE TRUE
"foo" # overwrite T
T < "bar" # overwrite F
F <c(T, T, F, T)
## [1] "foo" "foo" "bar" "foo"
Thus, get used to TRUE
and FALSE
from the beginning. T
and F
is shorter, but you might run into strange problems at some point which can be
avoided by just not using these abbreviations.
Function vector()
An alternative way to create a vector of a specific type and length is
the function vector()
. vector()
requires two input arguments, namely
the type of the vector (or ‘mode’) and the length of the vector.
# Double (numeric) vector of length 5
vector("double", length = 5)
## [1] 0 0 0 0 0
# Character vector of length 3
vector("character", length = 3)
## [1] "" "" ""
As you can see both vectors are empty. The default value of numeric vectors is
0
, the default for character vectors is ""
(just an empty character string,
a text without any characters in it), and FALSE
for logical vectors.
Numeric sequences
Regular sequences can be created using the function seq()
and its shortcuts
seq.int()
, seq_along()
, and seq_len()
.
 Sequences: sequences are a set of numeric or integer values between two
well defined points (
from
andto
) with an equidistant spacing. By default the spacing/increment is1L
but can be explicitly specified using theby
argument.  Repeat: Repeat allows to repeat a number (or a set of numbers) in different ways. We can repeat one element several times, repeat a vector multiple times, or repeat the elements of a vector multiple times.
Numeric sequences: We can create numeric sequences using the seq()
(generic function) or seq.int()
(internal/primitive function). You can try
it out yourself, seq()
and seq.int()
would do the very same in the examples
below. Some times it can be beneficial to use seq.int()
as it might be much
faster, but a bit less flexible. To create a sequence, we have to at least
specify three input arguments:
from
: Where to start the sequence.to
: Where to end the sequence.length
orby
: either define the length of the resulting vector, or the increment by which the values change. Note: ifto
is less thanfrom
,by
must be negative (decreasing sequence). Else you’ll get an error.
# Equidistant numeric sequence
seq(from = 1.5, to = 2.5, length.out = 5) # Specify length
## [1] 1.50 1.75 2.00 2.25 2.50
seq(from = 4, to = 4, by = 0.5) # Specify increment/interval
## [1] 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
## [16] 3.5 4.0
Technically, seq()
can also be used to create integer sequences. However,
this only works if all three arguments from
, to
, and by
(default is
1L
) are given, and all are integers!
# Explicitly define from/to/by as integers
seq(from = 10L, to = 100L, by = 10L)) (x <
## [1] 10 20 30 40 50 60 70 80 90 100
class(x)
## [1] "integer"
# 'from' defined as a numeric value (10.0)
seq(from = 10, to = 100L, by = 10L)) (y <
## [1] 10 20 30 40 50 60 70 80 90 100
class(y)
## [1] "numeric"
Thus, in practice, you should never rely on the class of the sequence returned
by seq()
.
Integer sequences: There are three distinct functions to create proper integer sequences.
<from>:<to>
: Two values separated by a colon (:
); creates sequences in steps of+1
or1
. If<from>
is an integer (or a numeric value without decimals), an integer sequence will be created.seq_along(x)
: Creates a sequence from1L
tolength(x)
wherex
is an existing object (e.g., a vector).seq_len(n)
: creates a sequence between1L
andn
.
Two examples for <from>:<to>
:
# Sequence from 1:4
1:4) (x <
## [1] 1 2 3 4
class(x)
## [1] "integer"
# Sequence from 30:25 (decreasing)
30:25) (y <
## [1] 30 29 28 27 26 25
class(y)
## [1] "integer"
To be able to use seq_along()
we need an object along which we would like
to create the integer sequence. Let us assume we have a character vector cities
(see below),
seq_along()
will return a sequence counting from 1
up to the number of elements in names
.
# Create character vector 'cities'
c("Vienna", "Paris", "Berlin", "Rome", "Bern")
cities <seq_along(cities)
## [1] 1 2 3 4 5
Last but not least, seq_len()
simply creates an integer sequence from 1L
to n
:
seq_len(10)
## [1] 1 2 3 4 5 6 7 8 9 10
Character Sequences: If you need the letters of the alphabet, there are two
convenient vectors which are available globally called LETTERS
and letters
.
LETTERS
contains the alphabet (no special characters) in upper case letters
("A"
, "B"
, …), letters
the same in lower case letters ("a"
, "b"
,
…). This can be used if one just wants to have some random character values.
# First 7 letters of the alphabet
1:7] LETTERS[
## [1] "A" "B" "C" "D" "E" "F" "G"
1:7] letters[
## [1] "a" "b" "c" "d" "e" "f" "g"
This is often a nicetohave when one needs the first few letters of the alphabet or requires random characters in the code and will be used every now and then in this book.
Replicating elements
Another often useful function is replicate. Replicate can be used for all vectors, no matter
if they are numeric, integer, character, or logical. The rep()
function can be
used in different ways:
 Replicate one specific value
n
times.  Given an existing vector: Replicate each element
n
times.  Given an existing vector: Replicate the entire vector
n
times.  Given an existing vector: Replicate the elements different amount of times.
Let us start with the most simple case and replicate the value 2L
five times:
rep(2L, 5) # Case (1)
## [1] 2 2 2 2 2
For all other cases we need an existing vector. Let us use a variant of the character
vector cities
and do both: repeat each element three times, or repeat the entire
vector three times.
c("Vienna", "Bern", "Rome")
cities <rep(cities, each = 3) # Case (2)
## [1] "Vienna" "Vienna" "Vienna" "Bern" "Bern" "Bern" "Rome" "Rome"
## [9] "Rome"
rep(cities, times = 3) # Case (3)
## [1] "Vienna" "Bern" "Rome" "Vienna" "Bern" "Rome" "Vienna" "Bern"
## [9] "Rome"
Alternatively we can specify which element in the original vector (cities
) should
be replicated how often.
rep(cities, times = c(3, 2, 5))
## [1] "Vienna" "Vienna" "Vienna" "Bern" "Bern" "Rome" "Rome" "Rome"
## [9] "Rome" "Rome"
As you can see, the first element of cities
is replicated three times, the second
twice, and the last one five times.
Similar as for seq()
there are some shortcuts, namely rep.int()
and rep_len()
.
rep.int()
is again the primitive (and sometimes much faster) version of rep()
.
rep_len()
can sometimes be very handy. What it does is to:
 Replicate elements of an existing vector until a specific length is reached.
Let us assume we would like to replicate the elements c(4, 5, 6)
as often
as necessary to get a vector of length 5
, or 9
:
rep_len(c(4, 5, 6), length.out = 5) # Resulting vector has length 5
## [1] 4 5 6 4 5
rep_len(c(4, 5, 6), length.out = 9) # Resulting vector has length 9
## [1] 4 5 6 4 5 6 4 5 6
It repeats the elements (always from left to right) until the new vector reaches the length specified as second input argument.
4.4 Coercion (mixing objects)
As mentioned in the first part of this section: Vectors can only contain elements of one type! Thus, if we combine elements of different types (e.g., combine a numeric value and character values as in the next example), R has to convert all elements into the same type/class as vectors can only contain elements of one type.
This is called ‘coercion’. We have to differ between implicit coercion (R chooses the best option) and explicit coercion (we force something to be of a different type).
Numeric and character
What happens if we try to mix elements of different classes, like numeric values and characters as in this example?
# Numeric and character = character
c(1.7, "1", "A")) (x <
## [1] "1.7" "1" "A"
class(x)
## [1] "character"
The object x
is of class "character"
. This is one example of implicit coercion,
and R decided that the best way is to convert all elements into character to
lose as little information as possible.
In this case, 1.7
(numeric) can easily be converted into the character
"1.7"
(character), but we can’t easily convert "1"
and "A"
into
numeric values. While it would work for "1"
(numerically 1.0
) the
character "A"
cannot be converted into a numeric value at all.
Note: The entire object x
is now a character object. This is important
as we can no longer use it for e.g., mathematical operations.
# What is x (character vector) times 5.0 (numeric)?
* 5.0 x
## Error in x * 5: nonnumeric argument to binary operator
Numeric and logical
Similar when mixing logical
and numeric
values. Again, R needs to bring
all values to the same type and tries to lose as little information as
possible. What we need to know here:
 Every numeric value equal to
0
/0L
converted to logical results inFALSE
.  Every numeric value not equal to
0
/0L
converted to logical results inTRUE
.  Every
TRUE
converted to numeric will be1
(or1L
).  Every
FALSE
converted to numeric will be0
(or0L
).
Let us assume we have a logical FALSE
and two numeric values 5.5
and 10.0
we could:
 Option A: Convert
FALSE
into0.0
and keep the other two values as they are.  Option B: Keep
FALSE
as it is and convert both (5.5
,10.0
) toTRUE
.
In the first case we only lose little information, in the second one we completely
lose the original values of 5.5
and 10.0
. Thus, we expect R to convert everything
to numeric (not to logical). Let us try:
# Combine TRUE, 5.5, and 10.0.
c(FALSE, 5.5, 10.0)) (x <
## [1] 0.0 5.5 10.0
class(x)
## [1] "numeric"
Et voilà !
Explicit coercion
As mentioned in the beginning, we can also perform explicit coercion where we as users decide in which type/class the data should be converted.
A range of as.*()
functions allows us to explicitly convert values between
different classes. E.g., as.character(1.42)
will convert the numeric
value 1.42
into a character "1.42"
.
A wide range of different as.*()
functions exist such as:
as.integer()
as.numeric()
as.character()
as.logical()
as.matrix()
 …
Let us create a new integer vector between 0
and 4
, assign it to x
, and
convert this integer vector to character and logical:
# let `x` be an integer vector
# with elements 0, 1, 2, 3, 4.
0:4) (x <
## [1] 0 1 2 3 4
# Coerce to character
as.character(x)
## [1] "0" "1" "2" "3" "4"
# Coerce to logical
as.logical(x)
## [1] FALSE TRUE TRUE TRUE TRUE
As mentioned previously: In the coercion to logical all numeric values 0
will
be converted to FALSE
and all others (including negative values) to TRUE
.
Even if we use explicit coercion, it is sometimes not possible to convert from
one class to another.
Characters from the alphabet, such as "a"
or "b"
cannot
be converted to numerics or integers. However, characters which are basically
only characters containing a numeric value (e.g., "100"
, "135.3"
) can be
coerced.
Missing values
If R is not able to convert elements, it will return NA
(and throw a
warning):
# let `x` be a character vector
c("a", "b", "c", "d")) (x <
## [1] "a" "b" "c" "d"
# Coerce to integer
as.integer(x)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
# But ...
c("1", "100", "a", "b", "33")
x <as.integer(x)
## Warning: NAs introduced by coercion
## [1] 1 100 NA NA 33
Coercion summary
The following table contains a summary of what will happen if you convert
the element x
into one of the four classes “numeric”, “integer”, “logical”, or “character”:
x  as.numeric  as.integer  as.logical  as.character 

2.9  2.9  2L  TRUE  “2.9” 
0  0  0L  FALSE  “0” 
4L  4  4L  TRUE  “4” 
0L  0  0L  FALSE  “0” 
TRUE  1  1L  TRUE  “TRUE” 
FALSE  0  0L  FALSE  “FALSE” 
“A”  NA  NA  NA  “A” 
“TRUE”  NA  NA  TRUE  “TRUE” 
“FALSE”  NA  NA  FALSE  “FALSE” 
NA  NA  NA  NA  “NA” 
Exercise 4.1 In the example above (Numeric and logical) we have
seen that R converts c(FALSE, 5.5, 10.0)
into a numeric vector.
Use the same example and use explicit coercion to convert the vector
into a logical vector (as.logical()
).
 What is the result? Check class.
 Can you explain why the result is as it is?
Solution. We can do this in one step or two steps. Both yield the same result, of course. Twostep approach:
Twoline approach: We first create a new object x
by combining
the three elements. Note: In this case, implicit coercion is used and x
will be a numeric vector (as in the example above). We then convert this
vector into a logical vector by explicitly calling as.logical(x)
.
c(FALSE, 5.5, 10.0)) (x <
## [1] 0.0 5.5 10.0
as.logical(x)) # Overwrite 'x' (x <
## [1] FALSE TRUE TRUE
class(x)
## [1] "logical"
Oneline approach: We could, of course, also define the vector
inside the as.logical()
function call. R does the very same, except
that we do not explicitly store c(FALSE, 5.5, 10.0)
, but directly forward
it to the as.logical()
function call.
as.logical(c(FALSE, 5.5, 10.0))) (x <
## [1] FALSE TRUE TRUE
class(x)
## [1] "logical"
Result: The result is c(FALSE, TRUE, TRUE)
, a vector of class logical.
What happens is the following:
c(FALSE, 5.5, 10.0)
is implicitly converted to the numeric vectorc(0, 5.5, 10.0)
. We then use explicit coercion and convert
c(0, 5.5, 10.0)
into a logical vector.  As we know, each numerical value
0
is treated asFALSE
and all numeric values not equal to0
asTRUE
.
Thus, we get FALSE
for the first element, and TRUE
for the second and third
which gives the result c(FALSE, TRUE, TRUE)
.
4.5 Mathematical operations
Once we have a vector (even vectors of length 1
; single value) we can perform
arithmetic operations on them. A list of available arithmetic operators can be
found on the help page ?Arithmetic
. The following ones are available:
operator  description  example 

+  addition  5 + 5 = 10 
  subtraction  5  5 = 0 
*  multiplication  2 * 8 = 16 
/  division  100 / 10 = 10 
^ (or **)  exponent/power  5^2 = 25 
%%  modulo  100 %% 15 = 10 
%/%  integer division  100 %/% 15 = 6 
While the first five should be well know, the latter two are often useful but might require some additional information.
 Integer division: The result of the integer division
100 %/% 15
is6
because15
fits into100
six times (15 * 6 = 90
).  Modulo: The modulo/mod of
100 %% 15
is10
as we can fit15
six times in100
(6 * 15 = 90
), and the remaining is10
(100  90 = 10
). This can be very handy to check if a number is divisible. E.g.,100 %% 10 = 0
as10 * 10 = 100
, rest0
.
Beside mathematical operators (see table above) a series of relational and
logical operators exist. The table above shows the most important operators
(see also ?Comparison
, ?Logic
and ?match
).
Type  Operator  Condition 

Relational  x < y 
Where x less than y 
x > y 
Where x greater than y 

x <= y 
Where x less than or equal to y 

x >= y 
Where x greater than or equal to y 

x == y 
Where x (exactly) equals y 

x != x 
Where x is not equal to y 

Logical  ! 
Negation (NOT; !x == 20 , same as x != 20 ) 
& 
Logical “and” (x >= 20 & x < 35 ) 

 
Logical “or” (x == 20  x > 45 ) 

xor 
Logical “exclusive or” (xor(x == 20, x == 50) ) 

Value matching  x %in% y 
Where x is in y (value matching; character) 
! x %in% y 
Where x is not in y (value matching; character) 
To briefly demonstrate what relational operators are used for: Let us assume we have
an integer vector x
and we want to check (i) which element is identical to 100L
,
or (ii) which of the elements is less than or equal to 30L
.
c(30L, 100L, 120L, 10L, 30L, 50L)
x <== 100L # Which element in 'x' is equal to 100L x
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
<= 30L # Which element in 'x' is less than or equal to 30L x
## [1] TRUE FALSE FALSE TRUE TRUE FALSE
Logical operators are used to combine multiple relational comparisons. For example,
which elements in x
are less than or equal to 30L
or exactly 100L
.
<= 30L  x == 100L x
## [1] TRUE TRUE FALSE TRUE TRUE FALSE
The result of relational or logical comparisons are always logical vectors and will become very useful to find/extract specific values, or to add logic to a program. We will discuss them in more detail and how to use them later (e.g., in Subsetting vectors, or Conditional Execution).
Vectors and scalars
We can use all the arithmetics on numeric vectors. E.g., multiply a sequence
(1:10
) with a scalar (a single number; e.g., 2
):
# Multiply a sequence by 2
1:10
x <* 2 x
## [1] 2 4 6 8 10 12 14 16 18 20
Note that the operation is applied elementbyelement. Each element of the vector
is multiplied by 2. The same is true for all operators. Another example: given
a sequence of numbers on x
: which ones are divisible by 10?
# Our vector x
c(43, 100, 34, 483, 1000)
x <%% 10 x
## [1] 3 0 4 3 0
All elements where the modulo is 0
(rest 0, see above) are divisible by 10
.
Vectors and Vectors (Matching Length)
We can also use arithmetics between two vectors. Let’s start with two vectors of
the same length. Let x
and y
contain a set of three numeric values which we
would like to add up.
# Our data
c(500, 400, 600)
x < c(10, 5, 100)
y <# Call x + y (addition) and x / y (division)
+ y x
## [1] 510 405 700
/ y x
## [1] 50 80 6
Both vectors (x
and y
) are of the same length. In this case, the operation
is applied on pairs of elements; i.e., the first of x
is added to the first
element of y
, the second element of x
to the second element in y
and so
forth and so on.
The same is true for the division, of course.
Vectors and vectors (nonmatching lengths)
What if the length of the two vectors differ? Well, let’s see. x
and y
again contain some numeric values. However, the length of x
is \(4\) while the
length of y
is only \(2\).
# Our data
c(500, 400, 600, 800)
x < c(100, 2)
y <# Call x + y (addition) and x / y (division)
+ y x
## [1] 600 402 700 802
/ y x
## [1] 5 200 6 400
In these situations, R recycles the shorter vector! Internally, R expands
the vector y
to c(100, 2, 100, 2)
. Thus,
 in case of
x + y
the result isx + c(100, 2, 100, 2)
 in case of
x / y
the result isx / c(100, 2, 100, 2)
This only works well if the length of one vector is a multiple of the length of the other one.
If you try multiplying c(1, 1, 1) * c(5, 2)
, R will calculate c(1, 1, 1) * c(5, 2, 5)
but
will raise a warning message that the longer object length (here \(3\)) is not a multiple
of the shorter object length (here \(2\)).
Logical values
Arithmetic operations can also be performed on logical elements. As we have
just learned TRUE
will always be converted to 1L
, FALSE
always to 0L
.
Thus, if we multiply c(TRUE, FALSE)
by a factor of 100L
R will return
the result of c(1L, 0L) * 100L
:
c(TRUE, FALSE) * 100L
## [1] 100 0
c(TRUE, FALSE) / 2
## [1] 0.5 0.0
This is not something one will use often, but can sometimes be useful.
4.6 Vector attributes
All objects in R can have additional so called attributes. Attributes are used to store additional metainformation. They are not part of the data itself, but part of the object, and are used for building more complex objects on top of vectors. Examples for such ‘more complex’ objects are matrices, data frames, factors, or date/time objects as we will see in the upcoming chapters.
Typical examples of attributes are:
 The class of the object.
 The dimension of matrices or arrays.
 Names of elements (vectors) or dimensions (matrices, arrays).
While some of the attributes are mandatory, others are optional. As we have
seen previously, the function class()
returns the class of the object.
The class is a mandatory attribute and every object will always have a class
and a length.
Beside the class, additional attributes can exist. To see if an object has
such attributes we can use the function attributes()
which extracts all
attributes (except class).
Let us first define a plain atomic vector without any additional attributes.
# Plain vector
c(54, 1.82)) (x <
## [1] 54.00 1.82
attributes(x)
## NULL
As shown, attributes(x)
returns NULL
. NULL
is a reserved word for the
null object in R, an empty object. NULL
basically means “nothing or no
information”. Our plain vector x
does not have any additional attributes.
Named vectors
One additional attribute is names. Names can be used to name individual elements of a vector. To demonstrate what this implies, let us create a numeric vector with two named elements.
# Adding optional names
c(age = 54, height = 1.82)) (y <
## age height
## 54.00 1.82
The output looks different now.
Both elements have an additional name (a character) which is linked to the
corresponding numeric value. Calling attributes()
now returns the information
that the object has an additional names attribute.
attributes(y)
## $names
## [1] "age" "height"
Technically, names are simply an optional attribute of an R object. However,
it is one of the ‘standard’ attributes for which a special function called
names()
exist. The function names()
can be used for both:
 To get the names of an object.
 To set or overwrite the names of an object.
Named vectors can often be useful as:
 They are selfexplanatory (each element is named).
 We can access specific elements by name (shown later).
As shown earlier, we can explicitly declare the names when creating a new vector. Let us use the very same:
# Create a named vector
c(age = 54, height = 1.82)) (x <
## age height
## 54.00 1.82
If we now call names(x)
we
will get a character vector with the names of the elements:
# Get/extract the names
names(x)
## [1] "age" "height"
names()
can not only be used to get the names, it can also be used to
set or overwrite the names of a vector. An example:
# Plain vector (without names)
c(1L, 2L, 3L)) (x <
## [1] 1 2 3
names(x)
## NULL
# Add names
names(x) < c("first", "second", "third")
names(x)
## [1] "first" "second" "third"
x
## first second third
## 1 2 3
If we assign values to names(x)
(with the sets operator <
) R will set
names or overwrite existing names of x
. Note: the vector of names on the
right hand side has to contain the same number of elements as the vector on the
left hand side.
In case a vector already has names and we set new ones, the existing ones will simply be overwritten.
# Change the names of 'x'
names(x) < c("1st", "2nd", "3rd")
x
## 1st 2nd 3rd
## 1 2 3
It is also possible to only assign names to some of the elements, or give multiple elements the same name. However, this is not recommended. An example:
c(200, age = 54, height = 1.82, height = 182, 200, 300)
x <names(x)
## [1] "" "age" "height" "height" "" ""
As shown, R gives all unnamed elements the name ""
(an empty character string).
Thus, we now have three elements in the vector called ""
and two called "height"
which can quickly result in confusion or problems.
4.7 Subsetting vectors
Creating vectors is one part of the game, the second part is to be able to get information from a vector. Accessing specific elements of a vector (or an object in general) is called subsetting. Vectors can be subsetted in different ways:
 By index (position in the vector).
 Based on logical vectors.
 By name (if set).
Note: It is very important to understand this concept as it works similar/the same for all R objects.
Subsetting by index
As we have learned previously, function calls use round brackets (e.g., class()
, length()
).
Subsetting uses squared brackets ([...]
or [[...]]
).
The simplest way of subsetting a vector is subsetting by index.
An index is simply the position of a specific element in a vector.
R always starts counting at 1 and we can access the first element
by calling x[1]
(“give me the 1st element of x
”), or the fifth
element by using x[5]
.
# Create a demo vector of length 5
c(10, 20, 0, 30, 50)) (x <
## [1] 10 20 0 30 50
1] # Extract first element by index x[
## [1] 10
5] # Fifth element x[
## [1] 50
If we need multiple elements at the same time, e.g., the first and the fifth, we can use a vector of indices as follows:
c(1, 5)] # Get the first and fifth x[
## [1] 10 50
c(5, 1)] # Or the fifth and first (different order) x[
## [1] 50 10
Similarly, we can use negative indexes to get all values except some specific ones.
E.g., x[1]
gives “all but the first element of x
”.
1] # all but first x[
## [1] 20 0 30 50
5] # all but fifth x[
## [1] 10 20 0 30
c(1, 5)] # all except first and fifth x[
## [1] 20 0 30
Outofrange indexes: What if we try to access elements outside the vector?
Our vector (x < c(10, 20, 0, 30, 50)
) only contains 5 elements.
Let us subset the elements 4:7
, or element 100
:
4:7] x[
## [1] 30 50 NA NA
100] x[
## [1] NA
R returns a missing value (NA
) for all elements which are not defined
inside the vector and does not run into an error.
Last/first few elements: Let us imagine we are only interested in
the first three or last three elements of the same vector x
.
We can do this using ‘subsetting by indices’ as follows:
1:3] # First three elements (element 1  3) x[
## [1] 10 20 0
3:5] # Last three elements (elements 3  5) x[
## [1] 0 30 50
There are convenience functions for that:
head(x, n = 3)
extracts the firstn
elements (by defaultn = 6
).tail(x, n = 3)
extracts the lastn
elements (by defaultn = 6
).
head(x, n = 3)
## [1] 10 20 0
tail(x, n = 3)
## [1] 0 30 50
If the vector is shorter than n
, R will return the entire vector (try
head(x, n = 10)
or tail(x)
).
The two functions head()
and tail()
are generic functions and will also
work for most more complex objects.
Subsetting by name
If we have named vectors we can make use of the names of the elements to access them. Let us use the following named vector:
c(age = 35, height = 1.72, zipcode = 6020) x <
We are interested in the value for the element "age"
. Instead of using
x[1L]
we can now use x["age"]
for subsetting:
"age"] x[
## age
## 35
Note: The argument inside the brackets has to be a character string.
x[age]
does something different  in this case R expects that there
is an object called age
. If not existing, you will get an error message:
x[age]
## Error in eval(expr, envir, enclos): object 'age' not found
As for subsetting by indexes, we can also ask for multiple elements
at the same time. However, negative subsetting (x["age"]
) does not work.
c("age", "zipcode")] # get both, "age" and "zipcode" x[
## age zipcode
## 35 6020
"age"] # Does not work, results in an error x[
## Error in "age": invalid argument to unary operator
Advantage of subsetting by name: We do not have to care about the position
of the element in the vector. Imagine the following: we have two objects called
person1
and person2
with the same information as we have had above.
# Create the two named vectors
c(age = 49, height = 1.84, zipcode = 5001)
person1 < c(zipcode = 5040, age = 13, height = 1.52) person2 <
Imagine we ASSUME that the age is always the first element and use subsetting by index (pick the first element of both vectors):
cat("Person 1 is ", person1[1L], "years old.\n")
## Person 1 is 49 years old.
cat("Person 2 is ", person2[1L], "years old.\n")
## Person 2 is 5040 years old.
You can see the problem. Named vectors help one to avoid this issue as we do not have to deal with the position in the vector but just the name. Let’s do the same but use subsetting by name:
cat("Person 1 is ", person1["age"], "years old.\n")
## Person 1 is 49 years old.
cat("Person 2 is ", person2["age"], "years old.\n")
## Person 2 is 13 years old.
The function cat()
(concatenate and print) can be used to create
output to the R console. It can take up multiple elements separated
by commas, combine them into a character string and print the result.
The \n
at the end indicates a line break, such that (after each print)
a new line is started.
cat()
is used to create nicely formatted output (see ?cat
for details).
Subsetting by logical vectors
Alternatively, we can use a logical vector to subset vectors. Remember:
logical vectors contain either TRUE
or FALSE
. In the context of
subsetting it can be used to get all elements where the logical vector contains
TRUE
. An example:
c(30, 10, 20, 0, 30, 50)
x <c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)] x[
## [1] 30 10
As you can see we only get the first two elements, as only the first two
elements of the logical vector contain TRUE
. The two vectors (vector x
to be
subsetted and the logical vector) should be of the same length, else the logical
vector will be recycled.
This is not often used as explicitly as shown above where we actually define
a logical vector (c(TRUE, TRUE, ...)
). Logical vectors are the result of binary
comparisons.
As we have already seen in subchapter Mathematical operations
R comes with a series of relational and logical operators. They now become very handy
to subset vectors. Given the vector above, let us find all elements in x
which are
larger than 25
.
> 25 # Shows the result of the relational comparison x
## [1] TRUE FALSE FALSE FALSE TRUE TRUE
> 25] # Use 'x > 25' (logical vector) for subsetting x[x
## [1] 30 30 50
Some more examples which you can try yourself given our vector
x < c(30, 10, 20, 0, 30, 50)
:
x[x == 30]
: Elements inx
wherex
is (exactly) equal to 30x[x == 30  x == 50]
: Elements wherex
is either equal to 30 OR equal to 50.x[x == 30 & x == 50]
: Elements wherex
is equal to 30 AND 50 at the same time. The result should be an empty vector as this is not possible.x[x >= 30 & x < 40]
: All elements wherex
is larger than or equal to 30, but less than 40.x[x < 30 & x > 40]
: All elements wherex
is less than 30 AND larger than 40 at the same time. Again, impossible, the result should be an empty vector.
Such expressions can also be more complex and/or use data over different
vectors. We can also store the logical vector onto a new object and use this
object for subsetting. As an example, imagine we have two numeric vectors age
and height
with the age and height of some people.
c( 25, 53, 21, 50, 18, 63)
age < c(1.73, 1.69, 1.83, 1.57, 1.91, 1.75) height <
We would like to get the age of those smaller than 1.8
.
In this case we first perform the relational comparison on the vector height
.
The result (a logical vector) is stored onto a new object res
.
In a second step we then use res
to subset the age
vector.
height < 1.8
res < res
## [1] TRUE TRUE FALSE TRUE FALSE TRUE
age[res]
## [1] 25 53 50 63
We can, of course, do this in one line:
< 1.8] age[height
## [1] 25 53 50 63
Or extend the expression to get the age for those who are smaller than 1.8
and
older than 50
years.
> 50 & height < 1.8] age[age
## [1] 53 63
Warning: Avoid using ==
for floating point arithmetic on double vectors.
This can lead to unexpected results due to small imprecisions. An example:
1.5  0.5) == 1 (
## [1] TRUE
1.9  0.9) == 1 (
## [1] FALSE
Alternative: Better use the function all.equal()
. The function checks
if numeric values are equal (but not exactly equal) by allowing for a (very small)
tolerance. For those interested what the tolerance is, check ?all.equal
.
all.equal(1.9  0.9, 1)
## [1] TRUE
Further functions
Besides the relational and logical operators shown above, some more exist.
&&
,
: Like&
(and) and
(or) but evaluate only first elements until results are determined.isTRUE()
,isFALSE()
: Test for single nonmissingTRUE
orFALSE
.
Take care: there is a distinct difference between &
/
and the longer form
&&
/
. While &
and 
can be used on vectors (elementwise logical comparison),
the longer form evaluates left to right examining only the first element of
each vector. The evaluation proceeds only until the result is determined.
While the long form can be very handy in some situations (e.g., conditional execution)
we will mainly use the short form (&
/
) in this course.
Index of TRUE
elements
Often one is interested in where elements are stored which match an expression.
Therefore we can use the function which()
. which()
returns the indices
of the elements where the expression evaluated to TRUE
.
c(30, 10, 20, 0, 30, 50)
x <>= 30 x
## [1] TRUE FALSE FALSE FALSE TRUE TRUE
which(x >= 30)
## [1] 1 5 6
The expression x >= 30
is TRUE
for the elements at position or index 1, 5, 6,
which is exactly what which()
returns.
Example: Find the position of the minimum and the maximum of the vector x
.
min(x) # The minimum
## [1] 0
== min(x) # Comparison x
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
which(x == min(x)) # Index/position
## [1] 4
The same can be done to find the maximum.
max(x) # The maximum
## [1] 50
== max(x) # Comparison x
## [1] FALSE FALSE FALSE FALSE FALSE TRUE
which(x == max(x)) # Index/position
## [1] 6
In addition, there are two functions which.min()
and which.max()
which are more
reliable to find the minimum/maximum of floating point numbers. However, they only
return the index of the very first occurrence of the minimum/maximum! Imagine the
following vector z
. The minimum is 0.5
which occurs twice, position 2 and 5.
Now compare the different result of which(z == min(z))
and which.min(z)
:
c(10, 0.5, 20, 30, 0.5)
z <which(z == min(z)) # Position/index of all minima
## [1] 2 5
which.min(z) # Position/index of first occurrence of minimum
## [1] 2
The same is valid for which.max()
.
Single and double brackets
In the beginning of this chapter it was mentioned that we can use single
brackets ([]
) or double brackets ([[]]
) for subsetting. While we will most
often use single brackets, let us see what they do different.
[]
keeps the name attribute, while [[]]
drops them all. If we have an
unnamed vector there is no difference between the two:
c(54, 1.82, 6020)) (x <
## [1] 54.00 1.82 6020.00
3] # Single brackets x[
## [1] 6020
3]] # Double brackets x[[
## [1] 6020
However, if we have a named vector the result of the two commands look different:
c("age" = 54, "height" = 1.82, "zipcode" = 6020)) (x <
## age height zipcode
## 54.00 1.82 6020.00
3] x[
## zipcode
## 6020
3]] x[[
## [1] 6020
The same is true for subsetting by name or logical vectors.
4.8 Vector functions
There are several functions which can be used to get information from vectors, manipulate vectors, or inspect vectors. This section highlights some useful ones, including:
Function  Description 

round() 
Round numeric values 
min() , max() 
Minimum and maximum 
mean() , median() 
Arithmetic mean and median 
sum() 
Sum 
sd() , var() 
Standard deviation and variance 
sqrt() 
Square root 
summary() 
Numerical summary 
str() 
Overview of the object structure 
any() , all() 
Test vector elements 
all.equal() 
Test for near equality 
sort() 
Sort a vector 
order() 
Obtain ordering of a vector 
To demonstrate these functions, we will use pseudorandom numbers from
a standard normal distribution which can be drawn using the function
rnorm()
. As random numbers are random (by definition) we would all
get completely different numbers.
For the purpose of reproducibility we will call set.seed()
to
fix a random seed. Thus, we should all get the same random numbers.
As they are no longer completely random, one calls them “pseudorandom”.
set.seed(6020) # A specific seed; fixed using an integer
rnorm(100) # Draw 100 random numbers from the standard normal distribution
x <head(x, n = 10) # Show first 10 elements
## [1] 0.3421647 0.4528670 0.6169125 0.6121245 1.2603337 0.4266803
## [7] 1.7461032 0.8736939 1.0627069 0.3600695
On your computer you should get the very same numbers as shown here (due to the seed we set).
Exercise 4.2 Understanding set.seed()
and pseudorandomization.
 Set a random seed using
set.seed()
. Chose any integer number you like (e.g.,set.seed(1)
).  Draw 5 random values from the standard normal distribution using
rnorm(5)
and store the result onr1
.  Set the very same seed again (as in step 1).
 Draw another 5 random numbers (
rnorm(5)
) and store there result onr2
.  Without setting the seed again, draw 5 more random numbers and store it on
r3
.
r1
, r2
, and r3
. What do you observe?
Solution. r1
and r2
should contain the very same numbers due to the seed we set.
The numbers are no longer purely random, but depend on the seed. This helps to
make things reproducible as we always obtain the same numbers.
r3
however should have different values. As we have not set the seed again,
we now get a different set of values.
# (1) Set a seed (using seed number 1 here).
set.seed(1)
# (2) Draw 5 random values.
rnorm(5)
r1 <# (3) Set the very same seed again.
set.seed(1)
# (4) And draw another 5 random values.
rnorm(5)
r2 <# (5) Draw 5 more random values without setting the seed.
rnorm(5)
r3 <# Compare results
r1
## [1] 0.6264538 0.1836433 0.8356286 1.5952808 0.3295078
r2
## [1] 0.6264538 0.1836433 0.8356286 1.5952808 0.3295078
r3
## [1] 0.8204684 0.4874291 0.7383247 0.5757814 0.3053884
We now use vector x
from above to demonstrate what the functions above do.
Let us start with round()
. By default, round()
rounds to 0
digits
(closest natural number) which can be respecified using the additional
argument digits
.
head(round(x, digits = 1), n = 10) # Round, show first 10 elements only
## [1] 0.3 0.5 0.6 0.6 1.3 0.4 1.7 0.9 1.1 0.4
We can also obtain/calculate the minimum, maximum, arithmetic mean, median, variance, standard deviation, the sum over all and the number of elements:
c(min = min(x), max = max(x),
mean = mean(x), median = median(x),
var = var(x), sd = sd(x),
sum = sum(x), length = length(x))
## min max mean median var
## 2.342863490 2.654889032 0.004146298 0.074796745 0.905993511
## sd sum length
## 0.951836914 0.414629842 100.000000000
Exercise 4.3 Using a bit of basic math we can show that:
 The arithmetic mean is nothing else than the sum over all elements divided by the number of elements: \(\bar{x} = \frac{1}{n} \sum_{i=1}^N x_i\).
 The standard deviation is the square root (
sqrt()
) of the variance \(\text{SD}(x) = \sqrt{\text{VAR}(x)}\).  The variance is the standard deviation squared: \(\text{VAR}(x) = \text{SD}(x)^2\).
Try to calculate the arithmetic mean using sum()
and
length()
and compare the result to the result of mean()
. Calculate the
variance using sd()
, and calculate the standard deviation using var()
.
Compare the results to the result of sd()
and var()
; all based on our vector x
.
Solution. Arithmetic mean: To get the arithmetic mean we only have to sum up all elements
of the vector x
and divide it by the number of elements (the length of the vector):
mean(x)
## [1] 0.004146298
sum(x) / length(x)
## [1] 0.004146298
We can use the function all.equal()
to compare if two numeric values
are (nearly) equal and that our result is correct:
all.equal(mean(x), sum(x) / length(x))
## [1] TRUE
Variance: We can use sd()
to calculate the standard deviation and
take it to the power of 2 (^2
), or simply multiply sd() * sd()
to get the same
result.
all.equal(var(x), sd(x)^2)
## [1] TRUE
all.equal(var(x), sd(x) * sd(x))
## [1] TRUE
Standard deviation: As the variance is the standard deviation squared, the
standard deviation must be the square root of the variance. The square root
can be calculated using sqrt()
:
all.equal(sd(x), sqrt(var(x)))
## [1] TRUE
Two more very important generic functions are str()
and summary()
.
They are very helpful to get a first impression of the structure of an
R object (str()
) and its data (summary()
).
str(x) # Structure
## num [1:100] 0.342 0.453 0.617 0.612 1.26 ...
str(x)
tells us that our object x
is a numeric (num
) vector
with elements \(1\) to \(100\) (1:100
), the rest of the output shows
us the first few elements of x
.
summary(x) # Numerical summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.342863 0.590057 0.074797 0.004146 0.648840 2.654889
The summary(x)
function returns a numerical summary of the object. For numeric values,
the summary consists of the minimum, 1st quartile, median, the mean, 3rd quartile,
and the maximum of all elements in x
. If the vector contains missing values (NA
)
the summary statistics would also tell us how many NA
s we have in the vector.
Missing values
So far, our vector x
contains no missing values. What if we have missing
values (NA
) in a vector? For demonstration, we create the very same vector
again but add 5 missing values at position 2, 4, 50, 70, and 99.
# Generate the vector (again).
set.seed(6020)
rnorm(100)
x <# Replace element 2, 4, 50, 70, and 99 with a missing value (NA)
c(2, 4, 50, 70, 99)] < NA x[
If we now check the summary statistics we can see that it reports the number of missing values:
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.342863 0.573185 0.053709 0.003272 0.719625 2.654889 5
Let us check if the number of NA
s reported by summary()
is correct.
Therefore we use is.na()
which returns a logical vector with TRUE
if
a value is a missing value and FALSE
else. As every TRUE
counts as
1
and ever FALSE
as 0
(numerically) we can simply sum up this vector.
# Count missing values
sum(is.na(x))
## [1] 5
We have 5 TRUE
s and 95 FALSE
s, thus the sum is 5
and (of course) the same
as summary()
returns.
Mathematical operations: as mentioned earlier all mathematical operations
return NA
as soon as at least one missing value is involved in the calculation.
This is true for all functions shown below:
c(min = min(x), max = max(x),
mean = mean(x), median = median(x),
var = var(x), sd = sd(x),
sum = sum(x))
## min max mean median var sd sum
## NA NA NA NA NA NA NA
How to handle missing values? There are basically two options on how we can handle this situation.
 Think about where they come from. Is it a problem? Is it realistic to have missing values, do I understand why they are there? Can I ignore it or is the data set invalid and I have to go back ‘to the lab’?
 Simply remove (delete) all
NA
s and ignore it.
The latter one is not the way to go! Whenever you encounter missing values in a data set: think about it. There are many situations where missing values are part of a data set and it is OK to have them included. All we have to do is to properly account for them. However, sometimes missing values indicate that your data set is invalid and you should not proceed as you may end up with false conclusions or results!
Once you know why it happens and that it is OK to use the data set we have two options:
 Remove all missing values. This can be done using the
na.omit()
function.na.omit()
simply removes missing values from the vector (you’ll end up with a shorter vector).  Keep the missing values and account for them using the
na.rm
option which is available for a wide range of functions.
Omit missing values: Omit all missing values in x
, store the new vector
(without missing values) on x2
and check it’s length.
na.omit(x)
x2 <c(length(x), length(x2))
## [1] 100 95
As you can see, we dropped the five missing values from x
. We can now
use x2
to do our calculations (e.g., calculate the mean or the variance).
Alternatively, we keep the missing values and account for them using the
additional input argument na.rm = TRUE
.
c(mean(x, na.rm = FALSE), # Default
mean(x, na.rm = TRUE)) # Account for missing values
## [1] NA 0.003271709
This option exists for a wide range of functions:
# Minimum, maximum, mean, median, variance, and standard deviation
c(min = min(x, na.rm = TRUE), max = max(x, na.rm = TRUE),
mean = mean(x, na.rm = TRUE), median = median(x, na.rm = TRUE),
var = var(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE),
sum = sum(x, na.rm = TRUE))
## min max mean median var sd
## 2.342863490 2.654889032 0.003271709 0.053709304 0.936059399 0.967501627
## sum
## 0.310812328
all()
and any()
These two functions check if either all elements match a certain condition (all), or at least one (any), based on logical vectors.
all()
simply checks if all elements are TRUE
. If so,
a logical TRUE
will be returned, otherwise FALSE
is returned.
Some examples:
# Are all elements TRUE?
all(c(TRUE, TRUE, TRUE))
## [1] TRUE
all(c(0L, 0L, 0L) == 0L)
## [1] TRUE
all(c(1, 2, 3, 4) > 2)
## [1] FALSE
any()
does basically the same but returns TRUE
as soon
as at least one element is equal to TRUE
.
# Is any element TRUE?
any(c(TRUE, FALSE, FALSE))
## [1] TRUE
any(c(0L, 0L, 0L) == 0L)
## [1] TRUE
any(c(5, 3, 1) > 0)
## [1] FALSE
This can be very handy if we would like to check if we for example have
any missing value in a vector. The function is.na()
checks if a value
is NA
. This in combination with any()
tells us if we have missing values.
c(10, 30, 50, NA, NA, 30)
x < c(10, 30, 50, 0, 50, 10)
y <any(is.na(x))
## [1] TRUE
any(is.na(y))
## [1] FALSE
4.9 Plotting vectors
Well, one key feature of R is it’s ability to graphically display data (plotting). Visualizing data in some sort or form can often be very helpful to better understand the data and/or identify possible problems such as outliers or unrealistic values. We (as humans) are very efficient in processing visual information, classify and cluster data, or identify certain patterns or structures on a plot  something which is nearly impossible by looking at dozen’s or hundreds of numeric values.
Base R comes with a range of plot types and plotting functions. In this chapter we will go trough the very basics of plotting vectors. Some more information about plotting can be found in the chapter Plotting in this book.
There are several different basic plot types, including:
Function  Description 

plot() 
Generic XY plot 
barplot() 
Bar plot 
pie() 
Pie chart 
hist() 
Histogram 
boxplot() 
BoxandWhisker plot 
These are all generic functions and will react differently for different objects. In this section we will focus on named and unnamed (plain) atomic vectors.
Generic XY Plot (Vector)
Creates a twodimensional scatter plot. If only one input is given (the ‘x’) the values will be plotted against the yaxis, while the xaxis shows the index or position of the element in the vector. A simple example:
# Create the data vector
c(10, 5, 30, 20, 5, 11)
d <plot(d)
R automatically labels the axis according to what is shown  and the name
of the object (here d
). There are several options to enhance the plot such
as setting a title, rename the axis, or adjusting the way the data are shown.
# Create the data vector
c(10, 5, 30, 20, 5, 11)
d <plot(d,
main = "Figure Title",
xlab = "index/position in the vector",
ylab = "measurement",
type = "l", col = 4, lwd = 3)
For more details check out the manual page ?plot
and/or the Plotting chapter.
We can also use it to plot two vectors (same length) against each other. The example below
visualizes the sine function between \(\pi\) and \(+\pi\).
# Sequence from pi to pi (used for xaxis)
seq(pi, pi, length.out = 501)
x <# Sine of 'x' evaluated for each element in 'x' (used for yaxis)
sin(x)
sinex <plot(x, sinex, type = "l")
We can also plot functions in R using plot(<function name>)
. The function
is then evaluated along the xaxis. Only works for functions taking up one
value x
as first input argument. By default, the function is evaluated
between 0 and 1, however, we can specify this using custom xlimits.
plot(sin, xlim = c(pi, pi), main = "Sine function", type = "l")
Other functions might be cos()
, sqrt()
, log()
, … Feel free to try it yourself!
Bar plot
Bar plots are typically used to visualize data for specific groups or classes. Let us assume we would like to visualize the outcome of the latest Austrian legislative election from 2019 (Nationalratswahlen; source wikipedia) and plot the number of seats the parties got out of the election.
This is a classical example to use a named vector in combination with barplot()
.
c("ÖVP" = 71,
election_results <"SPÖ" = 40,
"FPÖ" = 30,
"Grüne" = 26,
"NEOS" = 15,
"Fraktionslos" = 1)
barplot(election_results)
Again, we can modify the look and feel to our wishes:
barplot(election_results,
main = "Resultat Nationalratswahlen 2019",
xlab = "Kurzbezeichnung Partei",
ylab = "Sitze",
col = c("turquoise3", "firebrick2", "dodgerblue3", "green4", "violetred2", "gray70"))
Pie plot
An alternative plot type are pie plots. Keep in mind that pie plots are often not the most effective way of visualizing data. However, if needed, we can do it as follows:
c("Sky" = 70,
pyramid <"Sunny side\nof pyramid" = 20,
"Dark side\nof pyramid" = 10)
par(mar = rep(1, 4L)) # reduce border margin
pie(pyramid, init.angle = 35, col = c("#0099FF", "#FFE452", "#AA9526"),
main = "Pie")
Again, if we have a named vector, R automatically takes the names
to label the different sections of the pie (see ?pie
for more customization options).
Histograms
A way to visualize large amounts of numeric values are histograms.
Histograms show the distribution of the data. For demonstration, let us draw
1000 random values from a standard normal distribution (rnorm(1000)
).
We can now see how they are distributed (should show a normal distribution centered around 0 with a variance/standard deviation of 1):
rnorm(1000)
d <hist(d)
Another example: Mixing random values by drawing 1000 out of the standard normal distribution (with mean 0 and standard deviation 1) and 500 from a normal distribution with mean 7 and standard deviation 2:
c(rnorm(1000, mean = 0, sd = 1),
d <rnorm(500, mean = 7, sd = 2))
hist(d,
freq = FALSE, # Show density instead of frequency
breaks = 50) # Number of breaks
BoxandWhisker plot
An alternative way to show the distribution of numeric values are BoxandWhisker plots.
(5 + rnorm(100))^5 # Generate some values
x <boxplot(x)
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 132.5 1668.0 3366.4 4240.6 5753.1 26284.2
The box (gray box) shows the 1st, 2nd (median) and 3rd quartile,
the same as returned by summary()
. The Whiskers (lines) show the data which lie
within 1.5 times the inner quartile range from the box, while outliers (those being
further away) are plotted as dots. The full range of the plot also shows the
minimum/maximum of the data.
BoxandWhisker plots are especially useful to compare data from different groups, something shown at a later time.
Interested in additional learning material about vector algebra? Or even participating in the course “198812 VU Computer Programming Prerequisites” by the DiSC? Feel free to have a look at these tutorials with vector algebra exercises and how to do it in R. This is optional material and not part of the R programming course!