The aim of this tutorial is to introduce you to R. It is also intended to show you how to do data exploration. Note that the practicals in this short course are designed such that you should copy the code and run all the estimations yourself in R, to familiarize yourself with the program.
The idea is to have code chunks, with real data - to ensure you don’t become this cat:
Let’s first get you set up.
Make sure you have the following installed on your computer:
select the .exe download link from the table that corresponds to your version of R. Please Note that if you’re not sure what version of R you have, open or restart R and type in the console:
sessionInfo()[1]$R.version$version.string
Note when running Rtools.exe it is important to not simply flick through the installer options, but be careful check the box to have the installer edit your PATH.
In your console, run the following code:
install.packages("devtools")
Last thing, check that the following lines of code produces TRUE in your console:
library(devtools)
find_rtools()
Once you have downloaded RStudio and opened it up, select: File / New / RScript.
I suggest that you work with the notes that follow in RStudio, copying and pasting all the code in a R script, saving it and making sure you see that it works and you understand the process behind the coding.
Also, remember, the following are your friends:
Stackoverflow - the google of R.
But first, I suggest printing out and keeping this post with you as work through this tutorial (ignore the base R plotting section…).
Periodically, you might have to install a newer version of R. Luckily this takes only a few minutes, and happens automatically if you do the following:
# installing/loading the package:
if(!require(installr)) { install.packages("installr"); require(installr)}
check.for.updates.R() # tells you if there is a new version of R or not.
install.R() # download and run the latest R Version. Awesome.
copy.packages.between.libraries()
# copy your packages to the newest R installation from the one version before it.
# (if ask=T, it will ask you between which two versions to perform the copying)
I will illustrate how R basically thinks and how you should ``think’’ in R with an example. Note this is what we call base R coding. Other more optimized packages like dplyr and tidyverse variants that we will be using for our data analytics, have different notations. You should, however, be able to understand base R as you will often use it in your daily R lives as well.
Please don’t be discouraged by this session if you find it cumbersome or boring - the tidyverse way of doing things that we will get to later is infinitely superior.
The following are common types of variables in R:
TRUE # logical. Useful in many instances when evaluating logical phrases.
# integer
1L 1 # double (numeric)
1+0i # complex
"one" # character
Some types have special meanings:
NaN # double "not a number" (try 0/0)
Inf # double "infinity" (try 10 / 0)
-Inf # double "negative infinity" (try -10 / 0)
NA # logical "missing" value
# You can also use explicit labelling of NA:
NA_integer_ # integer "missing" value
NA_real_ # double "missing" value
NA_complex_ # complex "missing" value
NA_character_ # character "missing" value
NULL # special variable without a type
Some of the common operators we will be using include:
+ # addition
- # subtraction
/ # division
* # multiplication
^ # power
! # negation e.g.: if ( x != 10) ... which means not equal
& # logical and e.g.: if (x > 5 & x < 10) {print("medium range")}
| # logical or e.g.: if (x > 10 | x < 5) {print("outside medium range")}
== # equals e.g.: if (x == 10 | x == 5) {print("on boundary")}
!= # not equals (as above)
> # greater than
>= # greater than or equal
< # less than
<= # less than or equal
%in% # Checking if something is in something else
# As an illustration of the above, run some examples:
<- 10
x if( !is.numeric(x) ) print("x is a number!")
# How to use %in% and not (!) %in%
<- c("A", "B", "C")
vec <- "D"
x if( !x %in% vec ) print("x is not in the vec!")
# Notice for %in%, we don't use !%in%
Variable names must start with a letter or a .
, containing only letters, numbers and underscores.
You can use arrows or =
to define a variable:
# Same thing:
<- 10
x = 10
x -> 10
x
# Won't work (must start with a letter):
<- 10 2x
Almost every object in R is vectorized - it can contain multiple elements and has an attribute called “length.”
length(NaN)
length(1)
length("Words are meaningless") # a string is not a vector, but an element - with length 1
# Purely for illlustration - the length of words:
length( strsplit("Words are meaningless", " ")[[1]] )
c() can be used for combining multi-element vectors too:
<- c(1, 2)
vec1 <- c(2, 3)
vec2 c(vec1, vec2)
Functions work with vectors too, and provide predictable output:
<- c(1, 2, 3, 4, 5)
x <- c(2, 1, 3, 1, 5)
y
mean(x)
sum(y)
log10(x)
cor(x, y)
paste(x, y)
# Vectors can also have names:
<- c(a=1, b=2, c=3, d=4)
x names(x)
A vector can be treated as an array of elements, where we might want to subset it by selecting only a part of those elements. In R this is performed using the [
operator:
<- c(10, 20, 30, 40)
x 4]
x[c(1,3)] # choose element 1 and 3
x[6] # Gives NA as there's no 6th element
x[-3] # Removes 3rd element x[
Elements can be selected by specifying TRUE if the element should be returned, and FALSE otherwise:
<- c(10, 20, 30, 40, NA, 50)
x c(TRUE, FALSE, FALSE, FALSE)] x[
## [1] 10 NA
c(FALSE, TRUE)] # This is termed recycling, as it will repeat F and T - see output (selects 2, 4 and 6). x[
## [1] 20 40 50
# The above is useful for logical operations:
# Note the output of:
is.na(x)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE
# Now we can use it to replace NA in x as (using the logical operators):
is.na(x)] <- 1e6 # 1 with 6 zeros, or a million
x[ x
## [1] 1e+01 2e+01 3e+01 4e+01 1e+06 5e+01
# Advanced example: (read carefully)
# Another useful example is a simple winzorising (rnorm is for creating random data:
set.seed(123) # Set the randomness seed so that you get the same random data as I do:
# Create 100 random data points with a N(0,1) distribution:
<- rnorm(n = 100, mean = 0, sd = 1)
random_normal
# Let's now winzorise this vector by replacing values below the first quartile with the value of the first quartile:
<- quantile(random_normal, probs = 0.25)
Quartile_level
< Quartile_level] <- Quartile_level
random_normal[ random_normal
# Notice that exactly a quarter of the values are now equal to -0.4938:
sum( random_normal == Quartile_level )
## [1] 25
R uses columns and arrays in order to define data frames. These can be adjusted (as will be seen) to tell R whether your data is a time-series, panel, or whichever format intended.
Type the following code to create a set in R: (Remember: R is case sensitive!)
<- c("Very Happy", "Happy", "Not Happy")
R # Let's now create responses:
<- c(15,5,3)
W <- c(35,15,14)
M <- c(23,35,32)
C # Now we have many variables assigned names, but we now want to concatenate it all...
# i.e. let's merge the columns together in a single data.frame (as a single unit)
# To change the column names, simply type the name first:
<-
HappySurvey data.frame(Responses = R, Women = W, Men = M, Children = C)
# Now to isolate a column, say Men, and count the responses, Use the $ sign:
sum(HappySurvey$Men)
# Note the following syntax, these two are the same:
<- HappySurvey$Men
x <- HappySurvey[,3] # calling all rows of column 3
x # Other useful base R commands include:
mean(x)
min(x)
median(x)
summary(x)
# To do, say, a Chisquare test
chisq.test(HappySurvey$Men)
Congratulations, you’re starting to get working in R!
May there be many more happy coding hours!
Libraries are essential to using R. They are collections of functions that are neatly wrapped into a documented framework.
Let’s take a standard already installed library: stats
To see what a library has to offer, type the following in Rstudio, followed by CTRL + Spacebar (try it):
:: stats
If you are using a package or base R functions, and you do not know what the inputs are: do the following:
::acf ?stats
acf
to see a function’s code:::acf stats
To execute a function - you have to end it with brackets. Most of the time, a function requires inputs. These are given in the brackets themselves (as below).
To know what to input, simply go ?stats::acf to see documentation.
To just scan the parameters needed in a function, type stats::acf() and put cursor in brackets, followed by CTRL + Space. This makes finding functions in packages and their respective parameters to include super easy…
Try it yourself - as explained in the gif, rnorm creates random normal variables.
::acf(x = rnorm(100)) stats
Once you have identified a package online that you would want to use, you have to first download it to your computer (R knows where to save it.).
A package could either be on a peer-reviewed and verified platform (e.g. CRAN), or could be on someone’s github page, e.g. (unverified).
Let’s quickly install xts - a very nice time-series package we’ll be using quite a bit.
install.packages("xts")
# Once the package is installed on your computer, you can call it using the following command:
library(xts)
The package, , has now been loaded onto your computer using install.packages. You now never have to do this again (only for updates). But note that each time you open R, you have to tell it which packages need to be loaded using library. Typing the above, is now active for the whole session, until you close R.
You could add the following in code so as to install it only if it has not yet been installed:
# As you have already now installed dplyr, it shouldn't now install again.
# Please use this in your code:
if(!require(dplyr)) install.packages("dplyr") # This installs dplyr if you don't yet have it...
library(dplyr)
Pacman, or package manager, could also be used.
This package allows you to load and install packages easily and intuitively.
if(!require(pacman)) install.packages("pacman") # This installs pacman if you don't yet have it...
# Using p_load, pacman will try to load a package, or try install it if you have not already installed it:
::p_load(SomePackage) pacman
Other functionality include:
<- array(c(1:100), dim=c(2,5))
Ray #Note what happened in your Rstudio console when executing this command.
#Let's give the data row and column names as follows:
colnames(Ray) <-c("Men","boys","women","children","babies")
rownames(Ray) <- c("satisfaction","Communication")
# ... By the way, I have no idea what the above names imply...
We won’t be using this often, but just to note.
Matrices are vectors with an additional atribute of dimension.
We can use rbind (row bind) and cbind (column bind) to add together vectors or arrays.
A matrix in R is thus merely a vector with two additional attributes: number of rows (nrow(X)) and number of columns (ncol(X)). Consider, e.g., the following matrix construction:
<- seq(from = 1, to = 16, by = 3)
xx # Note what seq() does...
<- seq(from = 1, to = 100, by = 18)
yy #Now let's merge these into a single matrix:
<-
mat matrix(cbind(xx,yy), nrow=6,ncol=2)
Data frames allow us to view data more effectively and manage our analytics process more easily. Note that it can be slower in long estimation procedures, where arrays / matrices could be much quicker to use.
I strongly recommend getting used to using tibbles as default opposed to data.frames (see this link for motivation).
data frames in R require columns to be of the same type (e.g. numeric, character) and columns combined must have the same number of rows.
<- data.frame(id=1:20, name = LETTERS[1:20], state = rep(c(TRUE,FALSE), 10))
df
df
## id name state
## 1 1 A TRUE
## 2 2 B FALSE
## 3 3 C TRUE
## 4 4 D FALSE
## 5 5 E TRUE
## 6 6 F FALSE
## 7 7 G TRUE
## 8 8 H FALSE
## 9 9 I TRUE
## 10 10 J FALSE
## 11 11 K TRUE
## 12 12 L FALSE
## 13 13 M TRUE
## 14 14 N FALSE
## 15 15 O TRUE
## 16 16 P FALSE
## 17 17 Q TRUE
## 18 18 R FALSE
## 19 19 S TRUE
## 20 20 T FALSE
::p_load(tibble) # We will see later
pacman= rnorm(100)
x = rnorm(100)
y = rnorm(200)
z
<- tibble(var1 = x, var2 = y)
df1 # We could of course add columns that have text, or are contingent on preceding column values, e.g.:
<- tibble(var1 = x, var2 = y, message = "Msg", contingent_Column = ifelse(var1 > 0, "Positive", "Negative"))
df2 # Notice that dataframes should be balanced. The following will fail (check this yourself and argue why):
# df3 <- tibble(var1 = x, var2 = z)
You could of course also amend your existing dataframe to be a tibble - let’s see how this is done.
First, let’s make a quick digression and install a basic dataframe package from my github account to have illustrative data frames for the tutorials. The package is called fmxdat and can be found here.
::p_install_gh("Nicktz/fmxdat", force = T)
pacman# You could also use devtools for installing from github as follows:
# devtools::install_github("Nicktz/fmxdat")
Now you can use the data from the package using fmxdat::
Let’s see an example of a non-tibble dataframe (say read in from excel, or Bloomberg), and see how to make it a tibble (and why…)
::p_load(tidyverse) # We will see later
pacman::p_load(tibble) # We will see later
pacman
<- fmxdat::ugly_df
df_ugly head(df_ugly)
## date ARS_Spot AUD_Spot BGN_Spot BRL_Spot CAD_Spot
## 1 2012-01-04 0.0005578023 -0.026521362 -0.0002646378 -0.021568523 -0.011421320
## 2 2012-01-11 0.0016260163 0.005722599 0.0186619019 -0.016696677 0.006813469
## 3 2012-01-18 0.0019248609 -0.012168248 -0.0121483791 -0.019477277 -0.008238525
## 4 2012-01-25 0.0038654723 -0.015098613 -0.0186110746 -0.001131862 -0.006823576
## 5 2012-02-01 -0.0006917224 -0.010088744 -0.0042216712 -0.017393768 -0.005675595
## 6 2012-02-08 0.0018458699 -0.008704510 -0.0074697174 -0.007265179 -0.002603645
## CLP_Spot CNY_Spot COP_Spot CZK_Spot DKK_Spot
## 1 -0.013487476 -0.0043657761 -0.029033123 0.002558032 6.962091e-05
## 2 -0.010117188 0.0033363519 -0.017545537 0.017935582 1.851787e-02
## 3 -0.018389172 -0.0005700442 -0.006488240 -0.025031086 -1.218345e-02
## 4 -0.013065327 0.0035331210 -0.015657143 -0.029182252 -1.873411e-02
## 5 -0.006924644 -0.0041521945 -0.006275121 -0.006672378 -4.248493e-03
## 6 -0.014663659 -0.0021085340 -0.011427745 -0.022085729 -7.754271e-03
## EUR_Spot GBP_Spot HUF_Spot INR_Spot JPY_Spot
## 1 -0.0001545237 -0.010307298 0.035157393 -0.001483403 -0.0156530665
## 2 0.0185724404 0.018983626 -0.011913416 -0.020434167 0.0016944734
## 3 -0.0121278084 -0.007060500 -0.030449176 -0.028472323 -0.0003903709
## 4 -0.0185411262 -0.014050326 -0.046665543 -0.006273926 0.0124967456
## 5 -0.0041790138 -0.011115321 -0.024143268 -0.016465497 -0.0203137053
## 6 -0.0074660633 0.001011506 -0.003262495 -0.002512491 0.0110236220
## KRW_Spot MXN_Spot MYR_Spot NOK_Spot NZD_Spot
## 1 -0.0060569352 -0.020589076 -0.0096351287 -0.013854782 -0.023866954
## 2 0.0089840689 -0.006664234 0.0007974482 0.016707890 -0.011668758
## 3 -0.0150040552 -0.022030025 -0.0060557769 -0.012428425 -0.009199403
## 4 -0.0137347477 -0.021782416 -0.0130511464 -0.019757017 -0.015181195
## 5 0.0003552556 -0.009731930 -0.0119565924 -0.007248359 -0.018740990
## 6 -0.0094286856 -0.013480915 -0.0123972378 -0.009884282 -0.002635993
## PEN_Spot PHP_Spot PLN_Spot RON_Spot RUB_Spot
## 1 0.0000000000 -0.005230244 0.021491151 0.009351151 0.0027140039
## 2 -0.0011125533 0.005600640 0.011483667 0.025402708 0.0007301682
## 3 -0.0005568962 -0.012843828 -0.039978375 -0.014514296 -0.0089694843
## 4 -0.0001857355 -0.007138745 -0.037434423 -0.019548694 -0.0252796598
## 5 -0.0013003901 -0.003942943 -0.019737652 -0.003438707 -0.0166271093
## 6 0.0001860119 -0.016882058 -0.008544055 -0.006447121 -0.0143787950
## SEK_Spot SGD_Spot THB_Spot TRY_Spot TWD_Spot
## 1 -0.010290925 -0.008548325 -0.002535658 -0.023050107 -0.0005610931
## 2 0.014799865 0.002796334 0.008166508 -0.012036643 -0.0103034906
## 3 -0.013748181 -0.010611929 0.001985690 -0.013477089 -0.0003003103
## 4 -0.014801935 -0.011900102 -0.006700220 -0.013114754 -0.0001668892
## 5 -0.001112364 -0.008874099 -0.018431137 -0.027796235 -0.0099482557
## 6 -0.014239261 -0.003597410 -0.007743184 -0.003986787 -0.0047880770
## ZAR_Spot
## 1 -0.011341787
## 2 -0.003852687
## 3 -0.016730715
## 4 -0.008507697
## 5 -0.025526629
## 6 -0.017688986
Notice that the above is ugly, vague and not useful in any way.
Making this data.frame a tibble nicely summarizes what is in the data.frame:
::tibble(df_ugly) tibble
## # A tibble: 294 x 32
## date ARS_Spot AUD_Spot BGN_Spot BRL_Spot CAD_Spot CLP_Spot CNY_Spot
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2012-01-04 0.000558 -0.0265 -0.000265 -0.0216 -0.0114 -0.0135 -0.00437
## 2 2012-01-11 0.00163 0.00572 0.0187 -0.0167 0.00681 -0.0101 0.00334
## 3 2012-01-18 0.00192 -0.0122 -0.0121 -0.0195 -0.00824 -0.0184 -0.000570
## 4 2012-01-25 0.00387 -0.0151 -0.0186 -0.00113 -0.00682 -0.0131 0.00353
## 5 2012-02-01 -0.000692 -0.0101 -0.00422 -0.0174 -0.00568 -0.00692 -0.00415
## 6 2012-02-08 0.00185 -0.00870 -0.00747 -0.00727 -0.00260 -0.0147 -0.00211
## 7 2012-02-15 0.00214 0.00954 0.0150 0.00424 0.00402 -0.00808 0.000906
## 8 2012-02-22 0.000506 0.00555 -0.0138 -0.0134 -0.000200 0 -0.000635
## 9 2012-02-29 0.000689 -0.00876 -0.00569 0.00698 -0.00990 0 -0.000365
## 10 2012-03-07 -0.00443 0.0141 0.0133 0.0282 0.00758 0.0196 0.00257
## # ... with 284 more rows, and 24 more variables: COP_Spot <dbl>,
## # CZK_Spot <dbl>, DKK_Spot <dbl>, EUR_Spot <dbl>, GBP_Spot <dbl>,
## # HUF_Spot <dbl>, INR_Spot <dbl>, JPY_Spot <dbl>, KRW_Spot <dbl>,
## # MXN_Spot <dbl>, MYR_Spot <dbl>, NOK_Spot <dbl>, NZD_Spot <dbl>,
## # PEN_Spot <dbl>, PHP_Spot <dbl>, PLN_Spot <dbl>, RON_Spot <dbl>,
## # RUB_Spot <dbl>, SEK_Spot <dbl>, SGD_Spot <dbl>, THB_Spot <dbl>,
## # TRY_Spot <dbl>, TWD_Spot <dbl>, ZAR_Spot <dbl>
Tibbles are accessible in various ways. The most common base way is:
tibble[ rows, columns]
<- df_ugly %>% tibble::tibble()
df
# To isolate the first row, fifth column:
1, 5]
df[
# To isolate the entire third row:
3, ]
df[
# To remove the second column:
-2]
df[ ,
# select the first four rows of columns 1, 4 and 7
1:4, c(1,4,7)] df[
tibble$Column
To isolate a specific column, you could use $
head(df$ARS_Spot, 10)
select
We could also use the logical selector from the dplyr package (we will get to using this more later):
library(tidyr)
select(df, ends_with("_Spot"))
select(df, starts_with("ARS"))
select(df, contains(c("BRL", "CAD")))
# Using a vector:
<- c("date", "ARS_Spot", "AUD_Spot", 'BGN_Spot')
selectors
select(df, all_of(selectors))
# Using a function:
select(df, where(is.numeric))
You will use these verbs quite often when coding. Make a thick-underlined-and-yellow-highlighted note of the following (grepl and gsub!):
<- c("SOME STRING", "Another STRING", "Last One", "SOME Other One")
String
# gsub: replacement tool
gsub(x = String,
pattern = " STRING",
replacement = "")
## [1] "SOME" "Another" "Last One" "SOME Other One"
# grepl: Identifying patter match
# Let's trim String to only include entries with the word "SOME"
grepl(pattern = "SOME", String)
## [1] TRUE FALSE FALSE TRUE
# Notice that grepl produces TRUE / FALSE - this is called a logical operator in R.
# We can use this to subset our string as follows:
grepl(pattern = "SOME", String)] String[
## [1] "SOME STRING" "SOME Other One"
Notice above that String is a vector, not dataframe. Thus TRUE selects an entry, with FALSE causing a drop.
If we wanted to use grepl to subset columns, this can be done as follows (Notice my use of | to signify or):
<- fmxdat::ugly_df %>% tibble::as_tibble()
df
# Let's now subset our dataframe by selecting columns of df that contain a "Z" or a "Y", as well as the date column
grepl("date|Z|Y", colnames(df))] df[ ,
## # A tibble: 294 x 8
## date CNY_Spot CZK_Spot JPY_Spot MYR_Spot NZD_Spot TRY_Spot ZAR_Spot
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2012-01-04 -0.00437 0.00256 -0.0157 -0.00964 -0.0239 -0.0231 -0.0113
## 2 2012-01-11 0.00334 0.0179 0.00169 0.000797 -0.0117 -0.0120 -0.00385
## 3 2012-01-18 -0.000570 -0.0250 -0.000390 -0.00606 -0.00920 -0.0135 -0.0167
## 4 2012-01-25 0.00353 -0.0292 0.0125 -0.0131 -0.0152 -0.0131 -0.00851
## 5 2012-02-01 -0.00415 -0.00667 -0.0203 -0.0120 -0.0187 -0.0278 -0.0255
## 6 2012-02-08 -0.00211 -0.0221 0.0110 -0.0124 -0.00264 -0.00399 -0.0177
## 7 2012-02-15 0.000906 0.0320 0.0180 0.0107 0.00204 0.0112 0.0278
## 8 2012-02-22 -0.000635 -0.0163 0.0237 -0.00320 0.00434 -0.00515 -0.00334
## 9 2012-02-29 -0.000365 -0.0161 0.0107 -0.0103 -0.00575 -0.00603 -0.0297
## 10 2012-03-07 0.00257 0.0103 -0.000739 0.0117 0.0216 0.0187 0.0120
## # ... with 284 more rows
# Very important to note we can have or in grepl by using | between the entries
In contrast to vectors or columns in a tibble (where all elements in a column must be of the same type), we can use lists.
List structures can combine objects with different classes, which helps in organizing your data in a better workable format.
Lists can contain anything (dataframes, plots, vectors, etc) - while extracting information from it can be done using:
<- list(name="John", title = "President", firm="Google", salary = 500000)
Smith
print(Smith)
# To acces buckets in the list, we can use $ or [[]]:
$title
Smith# Which is equivalent to:
2]] # double [[k]] imply ``entry at k''
Smith[[
# Lists can also store characters and values:
# e.g. calculating the salary in 1000s:
$salary / 1e+3
Smith
# Note - length tells us how many elements are in the list:
length(Smith)
Note from the above:
[[ ]]
unboxes a list;[]
unboxes a data.frame or single list element
The following cannot be done to lists:
<- list(a=TRUE, b=c("a","b"), c=1:100)
l 1]] # works: first element of a list
l[[2]][1] # works: first entry of second element in list.
l[[c(1,2)]] # error - cannot return two objects, without a list to hold them
l[[-c(1,2)]] # error - negative indices cannot be used for list elements
l[[c(TRUE,FALSE,TRUE)]] # error - logical indices cannot be used
l[[c("a")]] l[[
It can even store dataframes and plots.
Work through the following code to understand more about subsetting, formatting and wrangling base dataframes:
::p_load(tibble)
pacman<- fmxdat::DailyTRIs
df_TRI
# Let's quickly remove the ' SJ' from all column names first.
# This requires us using colnames and gsub:
colnames(df_TRI) <- gsub(" SJ", "", colnames(df_TRI))
# Let's get each column's max value (we will explain apply next):
<- apply(df_TRI[,-1], 2, max, na.rm=T)
Max_vals Max_vals
## ABI ABL ACL AEG AGL AMS ANG AOD
## 9195.00 -Inf 17929.22 -Inf 30743.92 1798.53 5378.38 9300.00
## APN ARI ASR AVG BAW BGA BIL BTI
## -Inf -Inf -Inf 1180.00 1244.47 1117.64 11307.96 -Inf
## BVT CCO CFR CRH CRN DDT DRD DSY
## 7937.50 -Inf 3650.00 573.32 473.11 403.00 3628.02 3199.00
## ECO EXX FSR GFI GRT HAR IMP INL
## 7900.00 7500.00 575.00 1718.86 -Inf 3075.83 3130.97 3800.00
## INP IPL ITU JDG JNC KIO LBH LGL
## 1383.73 9750.00 8321.64 -Inf 358.78 -Inf -Inf 7791.02
## LHC LON MDC MND MNP MPC MSM MTN
## -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf
## MUR NED NPK NPN NTC OML PIK PPC
## -Inf 1345.00 1614.00 5933.33 -Inf 1870.00 -Inf -Inf
## REI REM RLO RMH SAB SAP SBK SHF
## -Inf 1725.00 -Inf -Inf 1244.00 1390.00 -Inf -Inf
## SHP SLM SOL TBS TKG TRU VNF VOD
## -Inf 1300.00 13055.00 7791.02 4805.53 -Inf 2600.00 -Inf
## WHL
## 1085.43
# Let's replace the is.infinite values with zero next (max of a column of NAs gives -infinity):
is.infinite(Max_vals)] <- 0
Max_vals[# Go through the above line of code to understand what is cooking here..
<- apply(df_TRI[,-1], 2, min, na.rm=T)
Min_vals is.infinite(Min_vals)] <- 0
Min_vals[
<- list()
Result $Name <- "Result of Max Returns"
Result$Max <- Max_vals
Result$Min <- Min_vals
Result$LastDate <- tail(df_TRI, 1) # Notice my use of tail here... Converse is head. Result
Notice that the object now created contains strings, vectors and a dataframe.
To find the highest max, use:
$Max [ which( Result$Max == max(Result$Max) ) ] Result
## AGL
## 30743.92
The above is a nice line of code to learn from:
$ is used to subset a list by name (as opposed to position, which would be e.g. Result[[2]])
which is great to use for subsetting. Use as which( vector, function)
‘==’ means equals. Could also use e.g. ‘!=,’ ‘>=,’ ‘<=’
A very useful set of functions used in base R are the Apply functions.
Apply returns a vector or array (or list) of values obtained by applying a function to parts of an array or matrix.
# Let's create a function for creating normal data:
# Note :::: function(inputs){commands}
<- function(n, mu, sd){ rnorm (n, mean = mu, sd = sd) }
NormalDataGen <- NormalDataGen(n = 100, mu = 5, sd = 2)
a <- NormalDataGen(n = 100, mu = 3, sd = 2.3)
b <- NormalDataGen(n = 100, mu = 2, sd = 1)
c <- NormalDataGen(n = 100, mu = 4, sd = 4)
d <- NormalDataGen(n = 100, mu = 3, sd = 2)
e <- data.frame(a,b,c,d,e)
data class(data)
# Now, using apply note the following:
# apply(data, 1 is row and 2 is columns,function to be done)
# Using this, to calculate the mean of all the columns, simply use:
apply(data,2,sum)
# Another SUPER USEFUL addition to this is that we can add commands
# that are part of the function. E.g. had we calculated column means,
# we could use "na.rm = TRUE"" (remove NAs) and "trim" to trim for outliers.
# Adding these commands to mean should not be done in brackets, but after commas
apply(data, 2, mean, na.rm = TRUE, trim=0.35)
# Note that na.rm and trim belond to the mean function, not apply...
How would you know what ``trim’’ does in the code above? USE STACKOVERFLOW. E.g., google the following:
what does trim do in apply r stackoverflow
You will likely find this useful answer.
Let’s do something a bit more advanced. In our df_TRI dataframe, let’s select only columns that have no NA.
which to trim our columns
!is.na() to identify which entries are not NA. The ! is always used to mean not in R.
all to see if all entries in a column are not na.
which( apply( !is.na(df_TRI), 2, all)) ] df_TRI[,
## # A tibble: 501 x 30
## Date ACL AGL AMS ANG BAW BGA BIL BVT CFR EXX
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2003-01-02 13078 26924. 756. 2966. 838. 592. 10140. 6794. 2300 2350
## 2 2003-01-03 12975. 27523. 752. 2980. 838. 597. 10345. 6825 2320 2400
## 3 2003-01-06 12718. 28764. 737. 2985. 834. 593. 10879. 7062. 2305 2400
## 4 2003-01-07 12913. 27992. 752. 2990. 833. 605. 10423. 7188. 2310 2421
## 5 2003-01-08 12973. 28085. 742. 2961. 848. 604. 10345. 7312. 2272 2515
## 6 2003-01-09 12924. 28086. 747. 2978. 853. 590. 10354. 7488. 2260 2600
## 7 2003-01-10 12666. 28091. 747. 2998. 853. 597. 10521. 7075 2260 2530
## 8 2003-01-13 13335. 28563. 742. 2857. 861. 605. 10573. 7275 2259 2529
## 9 2003-01-14 13403. 29209. 743 2857. 864. 620. 10778. 7250 2260 2505
## 10 2003-01-15 13552. 28132 742. 2862. 856. 620. 10192. 7500 2280 2549
## # ... with 491 more rows, and 19 more variables: FSR <dbl>, GFI <dbl>,
## # HAR <dbl>, IMP <dbl>, INL <dbl>, INP <dbl>, IPL <dbl>, ITU <dbl>,
## # LGL <dbl>, NED <dbl>, NPK <dbl>, OML <dbl>, REM <dbl>, SAB <dbl>,
## # SAP <dbl>, SLM <dbl>, SOL <dbl>, TBS <dbl>, VNF <dbl>
That is a powerful line of code - make sure you understand it as it encompasses all we’ve done to now.
Note: There is a whole family of apply, sapply, mapply, etc. that we won’t go into. But check it out, it’s a powerful suite of commands…
And and or can be useful commands in logical coding. They are written as: & and | respectively.
<- tibble(RandomData = rnorm(100,50,2))
x # Let's only keep data that lie between 48 and 50 (for whatever reason...)
<- x$RandomData[ which( x$RandomData < 50 & x$RandomData > 48) ]
z #==== Note:
# x$RandomData is the column
# using square brackets, like : x$RandomData[1:5], gives me e.g. the first five positions
# "which" gives me the positions that satisfy it being <50 & > 48... so:
# x[ which( x$RandomData <50 & x$RandomData >48) ] thus does the job!
Let’s look at the workhorse of programming: Loops.
Loops allow you to tell R to repeat something over and over again. See this example:
<- rnorm(1, mean = 100, sd = 10)
x # If statement:
if (x < 100){
<- "Heads!"
m else{
}<- "Tails!"
m
}print(m)
## [1] "Heads!"
# Note that if you repeat the abovestatement - it is the same as tossing coins...
Now such statements of logic often appear in loops to guide R to do what you want it to do. E.g.:
<- c() # Create an open vector to be filled:
sqr <- seq(from = 1, to = 100, by = 10)
x
# Loop to square every entry of x:
for(i in 1:length(x)){
<- x[i]^2
sqr[i]
}
print(sqr)
While could be used to repeat a routine until an objective is reached.
E.g.: we create a dataframe x, and multiply its entries by more random data - until the maximum of the original column exceeds 3.
<- tibble(Random = rnorm(100))
x
while( max(x$Random) < 3){
<- bind_cols(x, Another_Random = rnorm(100, 1, 0.1))
x $Random <- x$Random * x$Another_Random
x<- x[,-2]
x print(max(x$Random))
}
while functions are invaluable in certain applications - e.g. I recently built a function that proportionally reallocates weights from stocks with weights > 10%, until there is no stock above 10% (after each allocation - it is possible another stock now exceeds the 10% threshold requiring constraining and reallocation.)
Side note: paste is powerful, as it pastes together text. Compare paste(…, sep = “/”) and e.g. paste0(…). Replace … with text strings, e.g.:
<- "text1"
t <- "text2"
t2 paste(t,t2,sep = "/")
paste0(t,t2) # No need for sep, implied that sep = ""
Even better than paste is to use glue.
Install the package first (and also another one we’ll use later: lubridate):
::p_load(lubridate, glue) pacman
It is now even easier to paste, see this example:
<- lubridate::today()
date <- rnorm(1)
randomValue <- ifelse(randomValue > 0, "happy", "sad")
Mood print(
::glue("
glue\n=======\n
This is an example of pasting strings in text.
On this day, {date}, my mood is {Mood}, as my value is {ifelse(randomValue>0, 'positive', 'negative')}.
That's pretty cool right? ...note here will now follow\n\n a few spaces...
\n=======\n
")
)
##
## =======
##
## This is an example of pasting strings in text.
## On this day, 2022-02-07, my mood is sad, as my value is negative.
## That's pretty cool right? ...note here will now follow
##
## a few spaces...
##
## =======
Notice my use of ifelse above. Same notation here as in excel. We will use if, or, while, etc a lot in coding.
rmsfuns is a wrapper package that offers several convenient wrappers and commands that we will often be using in this course. See the details of this package here and the source code here.
Install the rmsfuns package as follows:
if (!require("rmsfuns")) install.packages("rmsfuns")
library(tidyverse)
library(rmsfuns)
<- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "alldays")
dates1 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdays")
dates2 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "calendarEOM") # Calendar end of month
dates3 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdayEOW") # weekday end of week
dates4 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdayEOM") # weekday end of month
dates5 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdayEOQ") # weekday end of quarter
dates6 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdayEOY") # weekday end of year dates7
Let’s create a random data frame, with a date column and random returns:
<- dateconverter(as.Date("2015-01-01"), as.Date("2017-05-01"), "weekdays")
datescol
<- rnorm(n = length(datescol), mean = 0.01, sd = 0.03)
RandomReturns
# This creates a vector of random data with a mean of 0.01 and sd of 0.03
<- data.frame( Date = datescol, Return = RandomReturns)
df
# Let's now quickly view this in excel..
ViewXL(df)
Notice that I intentionally printed the random file location when the function is run - which you can then proceed to delete after viewing it (so as not to clutter your pc).
Sometimes when running long calculations, you may find that you want to time your calculations.
While there are many ways to skin this cat, it might be easiest for you to make your prompt (what you see on the bottom of your Rstudio screen), be the current time. Now you can track through your estimation how long each step had taken (explicitly saving time could be done using Sys.time()):
PromptAsTime(TRUE)
<- 100
x Sys.sleep(3)
*x
xprint(x)
PromptAsTime(FALSE)
Now you can see after the fact, that x*x took 3 seconds (not really, I purposefully made it lag - but you get the point!).
It can be challenging building path structures in R, but build_path makes this easy. This could e.g. be used in a function that creates folders and populates it with figures / data points. E.g., suppose you want to create a folder structure as follows:
To create the folders to house the plots (and save their locations in a vector to use it later), run:
# Specify a root on your computer, e.g.:
<- "C:/Finmetrics/Practical2/Folder_Example"
Root
# Specify the sectors:
<- c("Financials", "Industrials", "HealthCare")
Sectors
# Construct the structure and bind them together:
# base R's paste0:
<- build_path( paste0(Root, "/", Sectors, "/Figures/") )
Locs
# glue's glue... (easier to use - see later my summary of this call):
<- build_path( glue::glue("{Root}/{Sectors}/Figures/") ) Locs
It is important to first, before anything else, save your R directory to somewhere where you will be able to locate it at a later stage. I strongly suggest though always working from a R.proj file.
In fact, your should always work from a R.proj file. Always
Step 1: Create a Folder within which to store your work.
Create a Rproj for the folder (from R studio: file/new project. ALWAYS work within this project environment - it makes life MUCH easier managing your work.
RProjects are awesome because:
+ It sets the working director to the folder
+ It keeps your working changes, which is useful if you forgot to save, e.g.
+ It allows you to easily navigate through your child-folders.
This should be standard for when starting any project.
I show you below first how to do this by hand, and second how to use fmxdat’s automatic way (much quick to set up a good folder environment. Please check both):
You can do this as follows.
1.) Create a directory on your PC where you want the project to reside (spend time thinking about intuitive directory structures).
2.) Highlight the directory address and copy it (Ctrl + C)
3.) From within any R studio instance, simply run:
::make_project(Open = T) fmxdat
This gives you a structure resembling (and if set Open = T, it will open this folder as well):
On the bottom right panel - open the README.Rmd file.
This README file will be your diary as your project develops.
When opening your README.Rmd file after opening your newly created .Rproj file, your screen in Rstudio will look as follows:
Your README.Rmd file is split into a text part (for title use hash, subtitle double hash, etc) and then also a code part (shaded part between the {r} and
).
Type what you are doing (purpose and steps), while placing your code to be executed in the code chunks (created by pressing Ctrl + Alt + i)
The code chunks then separate code from text.
You can then create htmls of your README (by pressing Ctrl + Shift + k, try it: how cool is that? We can discuss knitting htmls and pdfs at a later stage) - which is great for documentation purposes.
As an example, see the Texevier package’s README here which we will use later (it is a public package so you can view the repository directly): https://github.com/Nicktz/Texevier
Notice that github automatically transforms the README.Rmd file into an html landing page below the code - imagine all your projects looked like this how that will increase your productivity!
After you’ve worked on your project, your Readme.Rmd diary becomes you manual.
Next time you open your project - be sure to open it by double clicking on the .RProj file
There are very many ways to get data into R (openxlsx, readIDSm readxlm etc. etc.).
Which you use really mostly depends on what data source you are loading (and size). There are also several new platforms being developed for loading and saving data. Below follows a summary of how I currently prefer to load data (although this will likely change as a more stable version of feather is launched).
If you are unsure of how to load data, you might find this blog useful. It provides you with the code to load a dataset saved in a range of formats.
I prefer storing raw data that should be human readable (e.g. settings files) in csv format, and then using the readr package for loading data. For data you want to access later from R (i.e. saving R dataframes) I prefer using rds file storage (as one can set the compression easily - useful for BIG datasets).
For this, the readr package can be used (loaded as part of library(tidyverse) ).
Do the following to understand the process on your side:
::p_load(tidyverse)
pacman<- fmxdat::BRICSTRI
dta # Create a folder called data in your root.
# If you followed my gif's steps above - you are currently in a .Rproj environment, meaning you can straight-up use:
dir.create("Data/")
# And now let's store our data file in here:
write_rds(x = dta,
path = "Data/Example_File.rds")
# And now to load it:
<- read_rds(path = "Data/Example_File.rds") df
In addition to easily and more intuitively loading your data and better predicting the column types, readr allows the user to specify the types of columns you are loading (and even set defaults).
E.g., if I know my data’s first column is a date column, then most of my columns are character columns, except for a few exceptions, I can specify the loading to be as follows (see here for a Stackoverflow Q and A on this):
<- read_csv( "/Data/Indexes.csv", col_types = cols(.default = "d", Date = "D"))
data # Above, d stands for dbl (used for numerical columns)
# If you want to specify that a column contains characters, use "c"
For those interested, to see what other column types you get view this vignette.
Returning back to our loaded data, we can view it as follows:
View(data)
head(data, 10) # Top 10
tail(data, 10) # Bottom 10
colSums(is.na(data))
# na.rm = TRUE tells R to count all the non-NA items in each column:
colSums(data == 0, na.rm = TRUE)
colSums(data >= 500, na.rm = TRUE)
<- na.omit(data) dataNoNA
is.na(data)] <- 0 data[
Dates in R can be converted to a formal date column using several commands.
A simple way of doing this is to use the as.Date() command. See this page for more details and examples. To create a date column, we could e.g. use:
data.frame(date = seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
Suppose you imported an excel file with data, and R failed to read your date column as a valid date column, you could tell it that it is a date column using:
df$Date <- as.Date(df$Date, format="%d/%m/%Y")
Where Date is your date column, and the format is day, month and year, separated by a forward slash. Change the above to suit your example if you run into problems.
We will have more date setting examples later in the course.
Suppose you imported a Total Return Index series, with daily dates, but you wanted to only focus on weekdays, or Mondays, or even specific months (e.g. to study the January trading effect). This could be done as follows.
Let’s first create some fake data for illustrative purposes:
::p_load(dplyr)
pacman<-
df tibble(
date = seq(as.Date("2012-01-01"),as.Date("2015-08-18"),"day"),
TRI = rnorm(1326, 0.05, 0.04))
# Let's now create a function that only focuses on weekdays:
<- function(x) format(as.Date(x), "%a")
dow $day <- dow(df$date)
df!grepl("Sat|Sun", df$day), ]
df[
# To focus only on particular months, let's say January and February:
<- function(x) format(as.Date(x), "%b")
dom $Month <- dom(df$date)
dfgrepl("Jan|Jun", df$Month), ]
df[ # Note: Use the above only after loading package dplyr,
# and setting your data.frame in tibble::as_tibble format.
Funcions are the absolute building block of coding in R. It is always recommended to break down your workflow in separate functions, each specializing in one thing, to create a suite of functions that make up your analysis.
Think of functions in R as the cells that make up your body: small, squishy and important for life.
Below is a simple function to illustrate the notation. In the next session we will elaborate much more:
= rnorm(1000)
x # Let's now create a function to replace all positive values with the word positive:
<- function(vector) {
Max_Value
<- max(vector)
Maximum
return(Maximum)
}
Max_Value(vector = x)
## [1] 3.390371
# Notice that Maximum was not saved anywhere as the function's environment was temporary:
exists("Maximum")
## [1] FALSE
We will discuss in our next session how to structure your workflow and manage your projects - motivating the use of functional programming, and coding in a structured & replicable way.
We will be next time be discussing functions and project workflow in R.
This will be a bit of a philosophical session, but important for understanding how best to structure and plan your projects (and life) in R