Before we can work with data, we need to be able to read it in an represent it in R
Reading in data
Data structures
Important data structures in R are
- Atomic variables (numeric, char)
- Vectors (a sequence of atomic variables. For example c(1,2,3,4) is a vector of 4 numeric variables)
- Lists (like a vector, but unordered)
- A data frame (technically a list of vectors, untechnically think of it as a spreadsheat with collumns that can be of different atomic type)
Hint: you can find out the type and structure of an object in R with the class()
and str()
command
Tutorial and examples of how to use them are
-
Appendix 2 of the Essential statistics lecture notes
-
This tutoria
-
The QuickR reference
The following subsections only give an introduction
Some basic examples about data structures in R
Basic variable assignments
It is possible to assign a value to an object with the =
or ->
symbols
x = 55
x <- 55
and check the result with the print
command
print(x)
## [1] 55
#or
x
## [1] 55
R is case sensitive, if we try to type X
we got the following message: Error: object 'X' not found
A common mistake in R is to write an incomplete command. If we type sqrt(x
we get an error. We gotta complete the script (or press ESC if the programm gets stuck)
It is possible to remove an object from our workspace with the rm
command
rm(y)
## Warning in rm(y): object 'y' not found
Object names can include numbers
z1 = 15
z1
## [1] 15
But they cannot begin with numbers. 2z =15
, for example, would create an error
We use quotation marks to assign characters to objects
m1 = "Rclass"
m1
## [1] "Rclass"
m2="25" #like this, 25 is not a number anymore, but a character
m2
## [1] "25"
Vectors
We can create a vector by using the concatenated command c()
x1 = c(1,3,5,7,9)
x1
## [1] 1 3 5 7 9
gender = c("male", "female")
gender
## [1] "male" "female"
2:7 # sequence from 2 to 7
## [1] 2 3 4 5 6 7
seq(from=1, to=7, by=1) # sequence from 1 to 7 by 1
## [1] 1 2 3 4 5 6 7
seq(from=1, to=7, by=1/3) # sequence from 1 to 7 by 0,333
## [1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000
## [8] 3.333333 3.666667 4.000000 4.333333 4.666667 5.000000 5.333333
## [15] 5.666667 6.000000 6.333333 6.666667 7.000000
seq(from=1, to=7, by=0.25) # sequence from 1 to 7 by 0,25
## [1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
## [15] 4.50 4.75 5.00 5.25 5.50 5.75 6.00 6.25 6.50 6.75 7.00
Create a vector repeating something a certain number of times
rep(1, times=10)
## [1] 1 1 1 1 1 1 1 1 1 1
rep("vector", times=10)
## [1] "vector" "vector" "vector" "vector" "vector" "vector" "vector"
## [8] "vector" "vector" "vector"
rep(1:5, times=5)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(seq(from=2, to=5,by=0.25), times =5)
## [1] 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00 2.00
## [15] 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00 2.00 2.25
## [29] 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00 2.00 2.25 2.50
## [43] 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00 2.00 2.25 2.50 2.75
## [57] 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00
rep(c("m","f"),times=10)
## [1] "m" "f" "m" "f" "m" "f" "m" "f" "m" "f" "m" "f" "m" "f" "m" "f" "m"
## [18] "f" "m" "f"
we can add a value to each element of the vector
x = 1:5
x + 10
## [1] 11 12 13 14 15
x-10
## [1] -9 -8 -7 -6 -5
x*10
## [1] 10 20 30 40 50
We may add/subtract/mult/div but the vectors HAVE to be the same length
y = c(1,3,5,7,9)
x = 1:5
x
## [1] 1 2 3 4 5
y
## [1] 1 3 5 7 9
x+y
## [1] 2 5 8 11 14
x-y
## [1] 0 -1 -2 -3 -4
It is possible to extract elements of a vector by using squared brakets
y
## [1] 1 3 5 7 9
y[2]
## [1] 3
#A negative sign indicates R to extract all the elements except that one
y[-2]
## [1] 1 5 7 9
#extract the first and the third elements
y[c(1,3)]
## [1] 1 5
#extract all the elemets except the first and the third
y[-c(1,3)]
## [1] 3 7 9
#extract all the elements above the third one
y[y<3]
## [1] 1
Matrices
We can create a matrix of values by using the matrix command
matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=TRUE) #enter the elements rowwise
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=FALSE) #values entered columnwise
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
mat1= matrix(c(1,2,3,4,5,6,7,8,9),nrow=3,byrow=TRUE) #enter the elements
mat1
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Square brakets are used to grab elements from the matrix
mat1[1,2] #element in the first row and second column
## [1] 2
mat1[c(1,3),2]
## [1] 2 8
mat1[2,] #row 2nd and all the columns
## [1] 4 5 6
mat1[,1]
## [1] 1 4 7
mat1*10
## [,1] [,2] [,3]
## [1,] 10 20 30
## [2,] 40 50 60
## [3,] 70 80 90
Data frames
The data frame is the most common choice to represent your field data, thefore it’s important to know how to work with them and select data from them.
http://www.uni-kiel.de/psychologie/rexrepos/rerData_Frames.html
Working with big data
http://blog.revolutionanalytics.com/2014/08/the-iris-data-set-for-big-data.html
Text File Splitter: can split very larve (> several GB) textfiles.