Pretty much everything in R is either an object or a function (or a package that contains these). Objects hold data, and functions generally perform some kind operations, or transformations, on data.
R contains a lot of built-in functions that usually have the form function.name()
. Some examples are listed here: Built-in Functions.
# some built-in functions
Some simple arithmetic and logical functions are so common that they have special symbols and are referred to as “operators”: R Operators
# arithmetic
# logical
R can store several different types of data:
1e2
or 1e+2
equal 100)These data types can be stored in different kinds of containers, or “objects”. Different objects are structured in different ways and have different limitations on how they can be used.
R also has a small number of built-in constants:
pi
## [1] 3.141593
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
month.abb
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.name
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "December"
Most R functions expect a particular type of object, and will sometimes coerce similar objects (e.g. matrix, data frame) into the expected type if possible.
There are 5 basic classes of objects that we will come across most often:
Objects are created using an arrow (<-
or ->
) or an equals sign (=
).
Left-arrows are technically preferred, but many programmers are so used to using =
that this symbol is also a valid assignment operator.
The right-arrow works, but is discouraged because it creates confusion and doesn’t work using the =
assignment operator. Always assign your variables with the object name on the left!
# a character object
a <- "Hello"
a
## [1] "Hello"
class(a)
## [1] "character"
# right-arrow assignment works, but don't do it!
6*7 -> the.answer.to.the.universe.and.everything
the.answer.to.the.universe.and.everything
## [1] 42
class(the.answer.to.the.universe.and.everything)
## [1] "numeric"
# assignment using the equals sign
b = "world"
b
## [1] "world"
# oops! don't try this, ever.
2 = c
## Error in 2 = c: invalid (do_set) left-hand side to assignment
c
## function (...) .Primitive("c")
Here are a few tips for naming objects and functions (when you get around to making your own):
Use meaningful names to help you remember what they are for (i.e. avoid names like a
, b
, etc.). Valid characters include letters, numbers, .
, and _
. Pick a convention and stick to it.
R is case sensitive! My_Favorite_Things
is not the same as my_favorite_things
.
CamelCase seems to not be commonly used in R packages, functions, or object names. This is probably because using a separator provides a little more visual clarity.
Some words are reserved and cannot be used as object names (e.g. TRUE
, FALSE
, NA
, function
, etc.).
You should always avoid giving an object the same name as a function to avoid confusion.
# never do this!
mean = mean(2,3)
mean
## [1] 2
# a data frame
df.my_cool_data = data.frame(101:103)
df.my_cool_data
## X101.103
## 1 101
## 2 102
## 3 103
class(df.my_cool_data)
## [1] "data.frame"
A vector is the most basic type of data structure in R. Actually, pretty much everything in R is made up of vectors! We will see this as we go through each type of object. Vectors have the following properties:
This figure from R Programming for Research illustrates the different types of vectors:
c()
function.[2]
, [1:4]
length()
function.# - a scalar is just a vector of length 1
# Euler's number
# take a look at the result
# vector notation (see below)
vec.num = # combine five numbers into a vector
# take a look at it
# what *is* this data structure?
# subsetting: square brackets
# a single element
# a slice - watch out for non-existent elements!
# get values at index positions 1-3 and 5
# how long is it?
# reassignment
# adding elements
## Error: <text>:23:0: unexpected end of input
## 21: # adding elements
## 22:
## ^
A character vector is simply a collection of strings.
vec.char =
## Error: <text>:3:0: unexpected end of input
## 1: vec.char =
## 2:
## ^
Certain characters that have a special meaning in R. To tell R not to use them as special characters, you have to “escape” them using a backslash, \
. These include:
\\
to use a backslash in a string\"
to use double quotes within a string\n
new line character\t
tab character\b
backspace character# just printing it a string with escaped characters will show the backslashes
print(" \"hello world!\" ")
# this will show the raw contents of a string
writeLines(" \"Hello world!\" \n \"Well hello!\" ")
# this won't work
print(" "hello world!" ")
# you can do this instead
' "hello world" '
## Error: <text>:8:10: unexpected symbol
## 7: # this won't work
## 8: print(" "hello
## ^
Vectors containing the special values TRUE
and FALSE
can be generated directly or using logical operators that compare values. These can be really useful for finding values that match certain criteria.
Note that the value NA
(not available, like NAN in Python) is also special and is ignored in numerical comparisons, so we use the function is.na()
to find these instead.
vec.num
## Error in eval(expr, envir, enclos): object 'vec.num' not found
# compare all values to a number
# which values are NA?
# remove NA's
# filter based on some criteria
# index positions with values > 3
# subset of elements > 3 (but still has NA's!)
# get it right
# do it all in one go!
Note that some operators have precedence over others. When in doubt, use parentheses!
Factors are a way of defining groupings in your data, such as which items belong to control and treatment groups. Factors are very helpful when you want to perform an operation to values based on their group membership.
Below we compute the mean values for a set of measurements using the function tapply
, which applies a function on an object based on their corresponding factor groupings, and then creates a table with the result. We will go over the apply
family of functions later.
# some measurements (3 controls, 3 treatment replicates)
expvalues =
# the experimental groups
expgroup =
expvalues
## Error in eval(expr, envir, enclos): object 'expvalues' not found
expgroup
## Error in eval(expr, envir, enclos): object 'expgroup' not found
# the unique group "levels"
# summarize the number of items in each group
# get the mean values in each group with tapply
Factors are ordered alphabetically by default. You can reorder, rename, and add labels (such as numerical codes) to factors as desired. We will get into this more later.
A matrix is a two-dimensional array and can only contain a single data type (e.g. either numbers or strings, but not both).
A matrix can be created in multiple ways:
matrix()
: create a new matrix from a vector using nrow
and ncol
argumentsarray()
: create an array with dimensions given by the dim
argumentdim()
: convert a vector to a matrix by specifying its dimensions directlyrbind()
and cbind()
: combine several vectors or matrices together into rows or columns
x = c(4,12,1,5,21,7,10,7,2,19,24,3) # a vector
# 1) matrix function
mat =
mat
## Error in eval(expr, envir, enclos): object 'mat' not found
class(mat)
## Error in eval(expr, envir, enclos): object 'mat' not found
# 2) array function
mat2 =
mat2
## Error in eval(expr, envir, enclos): object 'mat2' not found
# 3) dim function on a vector
mat3 =
mat3
## Error in eval(expr, envir, enclos): object 'mat3' not found
# 4) bind vectors or matrices
v1 = c(letters[1:3])
v2 = c("d","e","f")
v3 = c(LETTERS[7:9])
rmat =
rmat
## Error in eval(expr, envir, enclos): object 'rmat' not found
rownames(rmat)
## Error in rownames(rmat): object 'rmat' not found
colnames(rmat)
## Error in is.data.frame(x): object 'rmat' not found
cmat =
cmat
## Error in eval(expr, envir, enclos): object 'cmat' not found
rownames(cmat)
## Error in rownames(cmat): object 'cmat' not found
colnames(cmat)
## Error in is.data.frame(x): object 'cmat' not found
# stick two matrices together - what happened to the names?
Note that if you try to bind two vectors of unequal lengths, the values in the shorter vector will be recycled.
#create vectors of values
sepal.length <- c(5.1, 4.9, 7.0, 6.4, 6.3, 5.8) # 6 values
sepal.width <- c(3.5, 3.0, 3.2, 3.2, 3.3) # only 5 values
# combine vectors as columns
mat <-
mat
## Error in eval(expr, envir, enclos): object 'mat' not found
# inspect subsets of the data
# just one element
# a row and all columns
# a column and all rows
# row and column names of matrix
colnames(mat)
## Error in is.data.frame(x): object 'mat' not found
rownames(mat)
## Error in rownames(mat): object 'mat' not found
# set row names
# access / assign values using row and column names
mat["p4","sepal.length"] = 5
## Error in mat["p4", "sepal.length"] = 5: object 'mat' not found
# get a logical matrix
# filter using a logical statement
Elements of vectors and matrices can be accessed in a variety of ways.
x[1:3] # sequential indices
## [1] 4 12 1
mat[c(1,3,4), ] # specific rows, all columns
## Error in eval(expr, envir, enclos): object 'mat' not found
mat[ , 1:2] # all rows, first two cols
## Error in eval(expr, envir, enclos): object 'mat' not found
mat[c(TRUE,TRUE,FALSE),] # a logical vector
## Error in eval(expr, envir, enclos): object 'mat' not found
x[x<5] # a logical expression
## [1] 4 1 2 3
# vector of named elements
fruit <- c(5, 10, 1, 20)
fruit
## [1] 5 10 1 20
names(fruit) <- c("orange", "banana", "apple", "peach")
fruit
## orange banana apple peach
## 5 10 1 20
lunch <- fruit[c("apple","orange")]
lunch
## apple orange
## 1 5
A drawback to matrices is that all the values have to be the same mode (either all numeric or all character). If you try to combine a combination of types, it will default to the character
class because numbers can be treated as characters, but not vice versa.
A dataframe is composed of vectors of the same length arranged as columns, where each column can be of a different type. This makes the data frame a perfect structure for mixed-type biomedical data!
Ideally, each row of a data frame should contain a single “observation”, i.e. data for a single item (e.g. a test subject, a gene), and each column should contain a different type of information or measurement (e.g. height, weight, etc.).
Other useful properties of data frames:
rownames()
and colnames()
functions, or during data importstringsAsFactors
parameter used to be set to TRUE
by default, but as of R4.0.0, this is no longer true!!!Dataframe elements can be accessed using the traditional []
notation, but named columns can also be accessed using the more convenient $
notation.
# data vectors
sepal.length = c(5.1, 4.9, 7.0, 6.4, 6.3, 5.8)
sepal.width = c(3.5, 3.0, 3.2, 3.2, 3.3, 2.7)
# group vector
species = c("setosa", "setosa",
"versicolor", "versicolor",
"virginica", "virginica")
# how can we do this using rep()?
# build a dataframe from three vectors - set strings as factors
iris.df =
# assign row names
rownames(iris.df) <- c("p1","p2","p3","p4","p5","p6")
## Error in rownames(iris.df) <- c("p1", "p2", "p3", "p4", "p5", "p6"): object 'iris.df' not found
iris.df
## Error in eval(expr, envir, enclos): object 'iris.df' not found
# view the species column
# basic matrix notation
# access species column by name (doesn't work for matrices)
# what is the class of this column?
# make a table to view the number of items in each group
# a row from a dataframe is still a dataframe because it contains mixed data types
class(iris.df[2,])
## Error in eval(expr, envir, enclos): object 'iris.df' not found
# get a summary of the data distribution
A list is a collection of objects. It can contain vectors, matrices, and dataframes of different lengths. It’s a great way to collate a bunch of different information.
Data frames are actually lists with a few restrictions:
List elements are indexed using double sqaure brackets [[]]
or (if named) using the same $
as for data frames.
Beware: there is no restriction on giving different items of a list the same name, but then only the first one will be accessed using the $
notation.
a_list = list(sepal.width, sepal.length,
c("setosa", "versicolor","virginica"))
a_list
## [[1]]
## [1] 3.5 3.0 3.2 3.2 3.3 2.7
##
## [[2]]
## [1] 5.1 4.9 7.0 6.4 6.3 5.8
##
## [[3]]
## [1] "setosa" "versicolor" "virginica"
a_list = list(width=sepal.width,
length=sepal.length,
species=c("setosa", "versicolor","virginica"),
numberOfFlowers=50,
length="test" # yikes! watch out!
)
a_list$width
## [1] 3.5 3.0 3.2 3.2 3.3 2.7
mean(a_list$length)
## [1] 5.916667
There are a variety of useful commands for getting information about objects.
obj = iris.df
## Error in eval(expr, envir, enclos): object 'iris.df' not found
class(obj)
## Error in eval(expr, envir, enclos): object 'obj' not found
is(obj) # show class and inheritance
## Error in is(obj): object 'obj' not found
is(obj, "data.frame") # test the class type
## Error in is(obj, "data.frame"): object 'obj' not found
str(obj)
## Error in str(obj): object 'obj' not found
head(obj)
## Error in head(obj): object 'obj' not found
summary(obj)
## Error in summary(obj): object 'obj' not found