Pretty much everything in R is either an object or a function (or a package that contains these). Objects hold data, and functions generally perform some kind operations, or transformations, on data.
R contains a lot of built-in functions that usually have the form function.name()
. Some examples are listed here: Built-in Functions.
# some built-in functions
sum(2,3)
## [1] 5
log(10)
## [1] 2.302585
Some simple arithmetic and logical functions are so common that they have special symbols and are referred to as “operators”: R Operators
# arithmetic
377/120
## [1] 3.141667
# logical
"this" == 42
## [1] FALSE
223/71 < pi
## [1] TRUE
R can store several different types of data:
1e2
or 1e+2
equal 100)These data types can be stored in different kinds of containers, or “objects”. Different objects are structured in different ways and have different limitations on how they can be used.
R also has a small number of built-in constants:
pi
## [1] 3.141593
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
month.abb
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.name
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "December"
Most R functions expect a particular type of object, and will sometimes coerce similar objects (e.g. matrix, data frame) into the expected type if possible.
There are 5 basic classes of objects that we will come across most often:
Objects are created using an arrow (<-
or ->
) or an equals sign (=
).
Left-arrows are technically preferred, but many programmers are so used to using =
that this symbol is also a valid assignment operator.
The right-arrow works, but is discouraged because it creates confusion and doesn’t work using the =
assignment operator. Always assign your variables with the object name on the left!
# a character object
a <- "Hello"
a
## [1] "Hello"
class(a)
## [1] "character"
# right-arrow assignment works, but don't do it!
6*7 -> the.answer.to.the.universe.and.everything
the.answer.to.the.universe.and.everything
## [1] 42
class(the.answer.to.the.universe.and.everything)
## [1] "numeric"
# assignment using the equals sign
b = "world"
b
## [1] "world"
# oops! don't try this, ever.
2 = c
## Error in 2 = c: invalid (do_set) left-hand side to assignment
c
## function (...) .Primitive("c")
Here are a few tips for naming objects and functions (when you get around to making your own):
Use meaningful names to help you remember what they are for (i.e. avoid names like a
, b
, etc.). Valid characters include letters, numbers, .
, and _
. Pick a convention and stick to it.
R is case sensitive! My_Favorite_Things
is not the same as my_favorite_things
.
CamelCase seems to not be commonly used in R packages, functions, or object names. This is probably because using a separator provides a little more visual clarity.
Some words are reserved and cannot be used as object names (e.g. TRUE
, FALSE
, NA
, function
, etc.).
You should always avoid giving an object the same name as a function to avoid confusion.
# never do this!
mean = mean(2,3)
mean
## [1] 2
# a data frame
df.my_cool_data = data.frame(101:103)
df.my_cool_data
## X101.103
## 1 101
## 2 102
## 3 103
class(df.my_cool_data)
## [1] "data.frame"
A vector is the most basic type of data structure in R. Actually, pretty much everything in R is made up of vectors! We will see this as we go through each type of object. Vectors have the following properties:
This figure from R Programming for Research illustrates the different types of vectors:
c()
function.[2]
, [1:4]
length()
function.# - a scalar is just a vector of length 1
my.scalar = exp(1) # Euler's number
my.scalar
## [1] 2.718282
my.scalar[1] # vector notation (see below)
## [1] 2.718282
vec.num = c(2,5,3,8,9)
vec.num
## [1] 2 5 3 8 9
is(vec.num) # what is this data structure?
## [1] "numeric" "vector"
# subsetting
vec.num[4]
## [1] 8
vec.num[5:6] # what happened? there is no element at index position 6
## [1] 9 NA
vec.num[c(1:3,5)] # get values at index positions 1-3 and 5
## [1] 2 5 3 9
length(vec.num)
## [1] 5
# reassignment
vec.num[4] = 10
vec.num
## [1] 2 5 3 10 9
# adding elements
vec.num[10] = 42
vec.num # interesting! what happened here?
## [1] 2 5 3 10 9 NA NA NA NA 42
A character vector is simply a collection of strings.
vec.char = c("Gene1", "Gene2", "Gene3")
vec.char
## [1] "Gene1" "Gene2" "Gene3"
Certain characters that have a special meaning in R. To tell R not to use them as special characters, you have to “escape” them using a backslash, \
. These include:
\\
to use a backslash in a string\"
to use double quotes within a string\n
new line character\t
tab character\b
backspace character# just printing it a string with escaped characters will show the backslashes
print(" \"hello world!\" ")
# this will show the raw contents of a string
writeLines(" \"Hello world!\" \n \"Well hello!\" ")
# this won't work
print(" "hello world!" ")
# you can do this instead
' "hello world" '
## Error: <text>:8:10: unexpected symbol
## 7: # this won't work
## 8: print(" "hello
## ^
Vectors containing the special values TRUE
and FALSE
can be generated directly or using logical operators that compare values. These can be really useful for finding values that match certain criteria.
Note that the value NA
(not available, like NAN in Python) is also special and is ignored in numerical comparisons, so we use the function is.na()
to find these instead.
vec.num
## [1] 2 5 3 10 9 NA NA NA NA 42
vec.num > 3
## [1] FALSE TRUE FALSE TRUE TRUE NA NA NA NA TRUE
# which values are NA?
is.na(vec.num)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
# remove NA's
!is.na(vec.num)
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
vec.filter = vec.num[!is.na(vec.num)]
vec.filter
## [1] 2 5 3 10 9 42
# filter by value
which(vec.num > 3) # index positions with values > 3
## [1] 2 4 5 10
vec.num[vec.num > 3] # subset of elements > 3 (but still has NA's!)
## [1] 5 10 9 NA NA NA NA 42
vec.filter[vec.filter > 3] # ok this looks good!
## [1] 5 10 9 42
# do it all in one go!
vec.num[ !is.na(vec.num) & vec.num > 3 ]
## [1] 5 10 9 42
Note that some operators have precedence over others. When in doubt, use parentheses!
Factors are a way of defining groupings in your data, such as which items belong to control and treatment groups. Factors are very helpful when you want to perform an operation to values based on their group membership.
Below we compute the mean values for a set of measurements using the function tapply
, which applies a function on an object based on their corresponding factor groupings, and then creates a table with the result. We will go over the apply
family of functions later.
# some measurements (3 controls, 3 treatment replicates)
expvalues = c(10,12,9, 52,47,60)
# the experimental groups
expgroup = factor(c(rep("control",3),rep("treatment",3)))
expvalues
## [1] 10 12 9 52 47 60
expgroup
## [1] control control control treatment treatment treatment
## Levels: control treatment
levels(expgroup) # the unique group "levels"
## [1] "control" "treatment"
table(expgroup) # summarize the number of items in each group
## expgroup
## control treatment
## 3 3
tapply(expvalues, expgroup, mean) # the mean values in each group
## control treatment
## 10.33333 53.00000
Factors are ordered alphabetically by default. You can reorder, rename, and add labels (such as numerical codes) to factors as desired. We will get into this more later.
A matrix is a two-dimensional array and can only contain a single data type (e.g. either numbers or strings, but not both).
A matrix can be created in multiple ways:
matrix()
: create a new matrix from a vector using nrow
and ncol
argumentsarray()
: create an array with dimensions given by the dim
argumentdim()
: convert a vector to a matrix by specifying its dimensions directlyrbind()
and cbind()
: combine several vectors or matrices together into rows or columns
vector_for_matrix = c(4,12,1,5,21,7,10,7,2,19,24,3) # a vector
# 1) matrix function
mat = matrix(data=vector_for_matrix,
nrow=3,
ncol=4) # every argument is named
mat = matrix(vector_for_matrix,
3,
4) # the same, but with positional arguments
mat
## [,1] [,2] [,3] [,4]
## [1,] 4 5 10 19
## [2,] 12 21 7 24
## [3,] 1 7 2 3
class(mat)
## [1] "matrix" "array"
# 2) array function
mat2 = array(data=vector_for_matrix,
dim=c(3,4))
mat2
## [,1] [,2] [,3] [,4]
## [1,] 4 5 10 19
## [2,] 12 21 7 24
## [3,] 1 7 2 3
# 3) dim function on a vector and a matrix
dim(vector_for_matrix) # check vector dimensions, dim() returns NULL for 1D objects
## NULL
matrix_from_vector = vector_for_matrix # make a copy of the vector
dim(matrix_from_vector) = c(3,4) # force the vector into 3x4 dimensions, which turns it into a matrix
matrix_from_vector
## [,1] [,2] [,3] [,4]
## [1,] 4 5 10 19
## [2,] 12 21 7 24
## [3,] 1 7 2 3
class(matrix_from_vector)
## [1] "matrix" "array"
# 4) bind vectors or matrices
v1 = c(letters[1:3])
v2 = c("d","e","f")
v3 = c(LETTERS[7:9])
v1
## [1] "a" "b" "c"
v2
## [1] "d" "e" "f"
v3
## [1] "G" "H" "I"
rmat = rbind(v1,v2,v3)
rmat
## [,1] [,2] [,3]
## v1 "a" "b" "c"
## v2 "d" "e" "f"
## v3 "G" "H" "I"
rownames(rmat)
## [1] "v1" "v2" "v3"
colnames(rmat)
## NULL
cmat = cbind(v1,v2,v3)
cmat
## v1 v2 v3
## [1,] "a" "d" "G"
## [2,] "b" "e" "H"
## [3,] "c" "f" "I"
rownames(cmat)
## NULL
colnames(cmat)
## [1] "v1" "v2" "v3"
# stick two matrices together - what happened to the names?
matrix_stitched = rbind(rmat,cmat)
matrix_stitched
## v1 v2 v3
## v1 "a" "b" "c"
## v2 "d" "e" "f"
## v3 "G" "H" "I"
## "a" "d" "G"
## "b" "e" "H"
## "c" "f" "I"
# assign new row names
rownames(matrix_stitched) = c("r1","r2","r3","r4","r5","r6")
matrix_stitched
## v1 v2 v3
## r1 "a" "b" "c"
## r2 "d" "e" "f"
## r3 "G" "H" "I"
## r4 "a" "d" "G"
## r5 "b" "e" "H"
## r6 "c" "f" "I"
Note that if you try to bind two vectors of unequal lengths, the values in the shorter vector will be recycled.
#create vectors of values
sepal.length <- c(5.1, 4.9, 7.0, 6.4, 6.3, 5.8) # 6 values
sepal.width <- c(3.5, 3.0, 3.2, 3.2, 3.3) # only 5 values
# combine vectors as columns
mat <- cbind(sepal.width,sepal.length) # a warning is displayed but the code runs!
## Warning in cbind(sepal.width, sepal.length): number of rows of result is not a
## multiple of vector length (arg 1)
mat # note that no NAs are introduced, be careful!
## sepal.width sepal.length
## [1,] 3.5 5.1
## [2,] 3.0 4.9
## [3,] 3.2 7.0
## [4,] 3.2 6.4
## [5,] 3.3 6.3
## [6,] 3.5 5.8
# row and column names of matrix
colnames(mat)
## [1] "sepal.width" "sepal.length"
rownames(mat)
## NULL
# set row names
rownames(mat) <- c("r1","r2","r3","r4","r5","r6")
mat
## sepal.width sepal.length
## r1 3.5 5.1
## r2 3.0 4.9
## r3 3.2 7.0
## r4 3.2 6.4
## r5 3.3 6.3
## r6 3.5 5.8
Elements of vectors and matrices can be accessed in a variety of ways.
vector_for_matrix[1:3] # sequential indices in a vector
## [1] 4 12 1
mat[1,] # a row and all columns
## sepal.width sepal.length
## 3.5 5.1
mat[,1] # a column and all rows
## r1 r2 r3 r4 r5 r6
## 3.5 3.0 3.2 3.2 3.3 3.5
mat[c(1,3,4), ] # specific rows, all columns
## sepal.width sepal.length
## r1 3.5 5.1
## r3 3.2 7.0
## r4 3.2 6.4
mat[ , 1:2] # all rows, first two cols
## sepal.width sepal.length
## r1 3.5 5.1
## r2 3.0 4.9
## r3 3.2 7.0
## r4 3.2 6.4
## r5 3.3 6.3
## r6 3.5 5.8
# inspect subsets of the data
mat[9] # just one element - what's going on here?
## [1] 7
# using this expression, the matrix is interpreted as a single (column) vector
# so element 9 is really the 3rd row in the 2nd column
# avoid this type of indexing with matrices!!!
mat[3,2] # ah, much better!
## [1] 7
# access / assign values using row and column names
mat["r2","sepal.length"] # access the value stored in the cell
## [1] 4.9
mat["r2","sepal.length"] = 5 # assign a new value
mat
## sepal.width sepal.length
## r1 3.5 5.1
## r2 3.0 5.0
## r3 3.2 7.0
## r4 3.2 6.4
## r5 3.3 6.3
## r6 3.5 5.8
# get a logical matrix
mat < 5
## sepal.width sepal.length
## r1 TRUE FALSE
## r2 TRUE FALSE
## r3 TRUE FALSE
## r4 TRUE FALSE
## r5 TRUE FALSE
## r6 TRUE FALSE
# filter using a logical expression
mat[ mat < 5 ]
## [1] 3.5 3.0 3.2 3.2 3.3 3.5
# a logical vector - beware, can produce confusing results!
mat[c(TRUE,TRUE,FALSE,FALSE),]
## sepal.width sepal.length
## r1 3.5 5.1
## r2 3.0 5.0
## r5 3.3 6.3
## r6 3.5 5.8
# can you see what happened here? the logical vector got recycled!
# so the first two rows and the last two rows were retained
# (1,2,5,6 interpreted as TRUE; 3,4 interpreted as FALSE)
# vector of named elements
fruit <- c(5, 10, 1, 20)
fruit
## [1] 5 10 1 20
names(fruit) <- c("orange", "banana", "apple", "peach")
fruit
## orange banana apple peach
## 5 10 1 20
lunch <- fruit[c("apple","orange")]
lunch
## apple orange
## 1 5
A drawback to matrices is that all the values have to be the same mode (either all numeric or all character). If you try to combine a combination of types, it will default to the character
class because numbers can be treated as characters, but not vice versa.
A data.frame is composed of vectors of the same length arranged as columns, where each column can be of a different type. This makes the data frame a perfect structure for mixed-type biomedical data!
Ideally, each row of a data frame should contain a single “observation”, i.e. data for a single item (e.g. a test subject, a gene), and each column should contain a different type of information or measurement (e.g. height, weight, etc.).
Other useful properties of data frames:
rownames()
and colnames()
functions, or during data importstringsAsFactors
parameter used to be set to TRUE
by default, but as of R4.0.0, this is no longer the case!!!Data.frame elements can be accessed using the traditional []
notation, but named columns can also be accessed using the more convenient $
notation.
# data vectors
sepal.length = c(5.1, 4.9, 7.0, 6.4, 6.3, 5.8)
sepal.width = c(3.5, 3.0, 3.2, 3.2, 3.3, 2.7)
# group vector
species = c("setosa", "setosa",
"versicolor", "versicolor",
"virginica", "virginica")
# we can also use rep()
species = c(rep("setosa",2),
rep("versicolor",2),
rep("virginica",2))
# build a dataframe from three vectors - set strings as factors
iris.df = data.frame(sepal.length,sepal.width,species, # you can list ANY number of vectors here
stringsAsFactors=TRUE)
iris.df
## sepal.length sepal.width species
## 1 5.1 3.5 setosa
## 2 4.9 3.0 setosa
## 3 7.0 3.2 versicolor
## 4 6.4 3.2 versicolor
## 5 6.3 3.3 virginica
## 6 5.8 2.7 virginica
# assign row names
rownames(iris.df) <- c("p1","p2","p3","p4","p5","p6")
iris.df
## sepal.length sepal.width species
## p1 5.1 3.5 setosa
## p2 4.9 3.0 setosa
## p3 7.0 3.2 versicolor
## p4 6.4 3.2 versicolor
## p5 6.3 3.3 virginica
## p6 5.8 2.7 virginica
# view the species column
iris.df[,3] # basic matrix notation
## [1] setosa setosa versicolor versicolor virginica virginica
## Levels: setosa versicolor virginica
iris.df$species # access species column by name (doesn't work for matrices)
## [1] setosa setosa versicolor versicolor virginica virginica
## Levels: setosa versicolor virginica
# what is the class of this column?
class(iris.df$species)
## [1] "factor"
# make a table to view the number of items in each group
table(iris.df$species)
##
## setosa versicolor virginica
## 2 2 2
# a row from a dataframe is still a dataframe because it contains mixed data types
class(iris.df[2,])
## [1] "data.frame"
# get a summary of the data distribution
summary(iris.df)
## sepal.length sepal.width species
## Min. :4.900 Min. :2.700 setosa :2
## 1st Qu.:5.275 1st Qu.:3.050 versicolor:2
## Median :6.050 Median :3.200 virginica :2
## Mean :5.917 Mean :3.150
## 3rd Qu.:6.375 3rd Qu.:3.275
## Max. :7.000 Max. :3.500
A list is a collection of objects. It can contain vectors, matrices, and data.frames of different lengths. It’s a great way to collate a bunch of different information.
Data frames are actually lists with a few restrictions:
List elements are indexed using double sqaure brackets [[]]
or (if named) using the same $
as for data frames.
Beware: there is no restriction on giving different items of a list the same name, but then only the first one will be accessed using the $
notation.
a_list = list(sepal.width, sepal.length,
c("setosa", "versicolor","virginica"))
a_list
## [[1]]
## [1] 3.5 3.0 3.2 3.2 3.3 2.7
##
## [[2]]
## [1] 5.1 4.9 7.0 6.4 6.3 5.8
##
## [[3]]
## [1] "setosa" "versicolor" "virginica"
a_list = list(width=sepal.width,
length=sepal.length,
species=c("setosa", "versicolor","virginica"),
numberOfFlowers=50,
length="test" # yikes! watch out!
)
a_list$width
## [1] 3.5 3.0 3.2 3.2 3.3 2.7
mean(a_list$length)
## [1] 5.916667
There are a variety of useful commands for getting information about objects.
obj <- iris.df
class(obj)
## [1] "data.frame"
is(obj) # show class and inheritance
## [1] "data.frame" "list" "oldClass" "vector"
is(obj, "data.frame") # test the class type
## [1] TRUE
str(obj)
## 'data.frame': 6 obs. of 3 variables:
## $ sepal.length: num 5.1 4.9 7 6.4 6.3 5.8
## $ sepal.width : num 3.5 3 3.2 3.2 3.3 2.7
## $ species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 2 2 3 3
head(obj)
## sepal.length sepal.width species
## p1 5.1 3.5 setosa
## p2 4.9 3.0 setosa
## p3 7.0 3.2 versicolor
## p4 6.4 3.2 versicolor
## p5 6.3 3.3 virginica
## p6 5.8 2.7 virginica
summary(obj)
## sepal.length sepal.width species
## Min. :4.900 Min. :2.700 setosa :2
## 1st Qu.:5.275 1st Qu.:3.050 versicolor:2
## Median :6.050 Median :3.200 virginica :2
## Mean :5.917 Mean :3.150
## 3rd Qu.:6.375 3rd Qu.:3.275
## Max. :7.000 Max. :3.500
You can write any single R object to a file using the function saveRDS()
. The convention is to use an extension “.RDS” in the file name, although this is not imperative. To read a file that has been saved in this way, use readRDS()
.
The main advantages of using this pair of functions to write/read data are:
read.table()
, which you will see below).The main disadvantage is:
# save iris.df
saveRDS(iris.df,
file="iris.df.RDS")
# read iris.df from the saved file
iris.df.from.file = readRDS("iris.df.RDS")
iris.df.from.file
## sepal.length sepal.width species
## p1 5.1 3.5 setosa
## p2 4.9 3.0 setosa
## p3 7.0 3.2 versicolor
## p4 6.4 3.2 versicolor
## p5 6.3 3.3 virginica
## p6 5.8 2.7 virginica
To write and read matrices and data.frames, a commonly used pair of functions is write.table()
and read.table()
.
If you are working with comma-separated files, you may use the special flavors write.csv()
and read.csv()
. The usage of ...table()
and ...csv()
is overall very similar, but some parameters are different - please refer to the help pages of these functions.
The main advantage of using this pair of functions to write/read data is:
The main disadvantages are:
# save iris.df using write.table()
write.table(iris.df,
file="iris.df.txt",
row.names=F)
# read iris.df from the saved file
iris.df.from.txt = read.table("iris.df.txt",
header=TRUE)
iris.df.from.txt
## sepal.length sepal.width species
## 1 5.1 3.5 setosa
## 2 4.9 3.0 setosa
## 3 7.0 3.2 versicolor
## 4 6.4 3.2 versicolor
## 5 6.3 3.3 virginica
## 6 5.8 2.7 virginica