Introduction

R Installation
Let’s R
Updating R periodically
How R thinks

Types
Operators
Defining variables
Vectors
Columns, arrays and more…

Basics - using libraries and functions

External packages
Pacman

Arrays, Matrices & Apply

Arrays
Matrices
Data frames
grepls, gsubs and more…
Lists
Apply
And or…
Loops…
while

Pasting things together

glue…

rmsfuns package
dateconverter
ViewXL
PromptAsTime
build_path
README
Viewing and working with your datafiles

Directories

Setting the directory

Getting data into R

read_csv functionalities

A few useful commands:

Dealing with Dates

Wrangling with dates

Functions

Conclusion

Introduction

The aim of this tutorial is to introduce you to R. It is also intended to show you how to do data exploration. Note that the practicals in this short course are designed such that you should copy the code and run all the estimations yourself in R, to familiarize yourself with the program.

The idea is to have code chunks, with real data - to ensure you don’t become this cat:

Let’s first get you set up.

R Installation

Make sure you have the following installed on your computer:

With Rtools:

select the .exe download link from the table that corresponds to your version of R. Please Note that if you’re not sure what version of R you have, open or restart R and type in the console:

sessionInfo()[1]$R.version$version.string

Note when running Rtools.exe it is important to not simply flick through the installer options, but be careful check the box to have the installer edit your PATH.

devtools

In your console, run the following code:

install.packages("devtools")

Almost there…

Last thing, check that the following lines of code produces TRUE in your console:

library(devtools)
find_rtools()

Let’s R

Once you have downloaded RStudio and opened it up, select: File / New / RScript.

I suggest that you work with the notes that follow in RStudio, copying and pasting all the code in a R script, saving it and making sure you see that it works and you understand the process behind the coding.

Now you can type away, and when you are done: Highlight the code that you want to run, and press: CNTRL + Enter or CNTRL + R.

Also, remember, the following are your friends:

Google
Stackoverflow - the google of R.
Quick R

But first, I suggest printing out and keeping this post with you as work through this tutorial (ignore the base R plotting section…).

Updating R periodically

Periodically, you might have to install a newer version of R. Luckily this takes only a few minutes, and happens automatically if you do the following:

# installing/loading the package:
if(!require(installr)) { install.packages("installr"); require(installr)} 
check.for.updates.R() # tells you if there is a new version of R or not.
install.R() # download and run the latest R Version. Awesome.
copy.packages.between.libraries()
# copy your packages to the newest R installation from the one version before it.
# (if ask=T, it will ask you between which two versions to perform the copying)

How R thinks

I will illustrate how R basically thinks and how you should ``think’’ in R with an example. Note this is what we call base R coding. Other more optimized packages like dplyr and tidyverse variants that we will be using for our data analytics, have different notations. You should, however, be able to understand base R as you will often use it in your daily R lives as well.

Please don’t be discouraged by this session if you find it cumbersome or boring - the tidyverse way of doing things that we will get to later is infinitely superior.

Types

The following are common types of variables in R:

TRUE                  # logical. Useful in many instances when evaluating logical phrases.
1L                    # integer
1                     # double (numeric)
1+0i                  # complex
"one"                 # character

Some types have special meanings:

NaN               # double "not a number" (try 0/0)
Inf               # double "infinity" (try 10 / 0)
-Inf              # double "negative infinity" (try -10 / 0)
NA                # logical "missing" value 

# You can also use explicit labelling of NA:

NA_integer_       # integer "missing" value
NA_real_          # double "missing" value
NA_complex_       # complex "missing" value
NA_character_     # character "missing" value
NULL              # special variable without a type 

Operators

Some of the common operators we will be using include:

+    # addition
-    # subtraction
/    # division
*    # multiplication
^    # power
!    # negation e.g.: if ( x != 10) ... which means not equal 
&    # logical and e.g.: if (x > 5 & x < 10) {print("medium range")}
|    # logical or e.g.: if (x > 10 | x < 5) {print("outside medium range")}
==   # equals e.g.: if (x == 10 | x == 5) {print("on boundary")}
!=   # not equals (as above)
>    # greater than
>=   # greater than or equal
<    # less than
<=   # less than or equal
%in% # Checking if something is in something else
  
# As an illustration of the above, run some examples:
  
x <- 10
if( !is.numeric(x) ) print("x is a number!")
# How to use %in% and not (!) %in%
vec <- c("A", "B", "C")
x <- "D"
if( !x %in% vec ) print("x is not in the vec!")
# Notice for %in%, we don't use !%in%

Defining variables

Variable names must start with a letter or a ., containing only letters, numbers and underscores.

You can use arrows or = to define a variable:

# Same thing:
x <- 10
x = 10
x -> 10

# Won't work (must start with a letter):
2x <- 10

Vectors

Almost every object in R is vectorized - it can contain multiple elements and has an attribute called “length”.

length(NaN)
length(1)
length("Words are meaningless")  # a string is not a vector, but an element - with length 1

# Purely for illlustration - the length of words:
length( strsplit("Words are meaningless", " ")[[1]] )

c() can be used for combining multi-element vectors too:

vec1 <- c(1, 2)
vec2 <- c(2, 3)
c(vec1, vec2)

Functions work with vectors too, and provide predictable output:

x <- c(1, 2, 3, 4, 5)
y <- c(2, 1, 3, 1, 5)

mean(x)
sum(y)
log10(x)
cor(x, y)
paste(x, y)

# Vectors can also have names:
x <- c(a=1, b=2, c=3, d=4)
names(x)

A vector can be treated as an array of elements, where we might want to subset it by selecting only a part of those elements. In R this is performed using the [ operator:

x <- c(10, 20, 30, 40)
x[4]
x[c(1,3)] # choose element 1 and 3
x[6] # Gives NA as there's no 6th element
x[-3] # Removes 3rd element

Elements can be selected by specifying TRUE if the element should be returned, and FALSE otherwise:

x <- c(10, 20, 30, 40, NA, 50)
x[c(TRUE, FALSE, FALSE, FALSE)]

## [1] 10 NA

x[c(FALSE, TRUE)] # This is termed recycling, as it will repeat F and T - see output (selects 2, 4 and 6). 

## [1] 20 40 50

# The above is useful for logical operations:
# Note the output of: 
is.na(x)

## [1] FALSE FALSE FALSE FALSE  TRUE FALSE

# Now we can use it to replace NA in x as (using the logical operators):
x[is.na(x)] <- 1e6 # 1 with 6 zeros, or a million
x

## [1] 1e+01 2e+01 3e+01 4e+01 1e+06 5e+01

# Advanced example: (read carefully)
# Another useful example is a simple winzorising (rnorm is for creating random data:
set.seed(123) # Set the randomness seed so that you get the same random data as I do:

# Create 100 random data points with a N(0,1) distribution:

random_normal <- rnorm(n = 100, mean = 0, sd = 1)

# Let's now winzorise this vector by replacing values below the first quartile with the value of the first quartile:
Quartile_level <- quantile(random_normal, probs = 0.25)

random_normal[ random_normal < Quartile_level] <- Quartile_level

# Notice that exactly a quarter of the values are now equal to -0.4938:

sum( random_normal == Quartile_level )

## [1] 25

Columns, arrays and more…

R uses columns and arrays in order to define data frames. These can be adjusted (as will be seen) to tell R whether your data is a time-series, panel, or whichever format intended.

Type the following code to create a set in R: (Remember: R is case sensitive!)

R <- c("Very Happy", "Happy", "Not Happy") 
# Let's now create responses:
W <- c(15,5,3)
M <- c(35,15,14)
C <- c(23,35,32)
# Now we have many variables assigned names, but we now want to concatenate it all...
# i.e. let's merge the columns together in a single data.frame (as a single unit)
# To change the column names, simply type the name first:
HappySurvey <- 
  data.frame(Responses = R, Women = W, Men = M, Children = C)
# Now to isolate a column, say Men, and count the responses, Use the $ sign:
sum(HappySurvey$Men)
# Note the following syntax, these two are the same:
x <- HappySurvey$Men
x <- HappySurvey[,3]   # calling all rows of column 3
# Other useful base R commands include:
mean(x)
min(x)
median(x)
summary(x)
# To do, say, a Chisquare test 
chisq.test(HappySurvey$Men)

Congratulations, you’re starting to get working in R!

May there be many more happy coding hours!

Basics - using libraries and functions

Libraries are essential to using R. They are collections of functions that are neatly wrapped into a documented framework.

Let’s take a standard already installed library: stats

To see what a library has to offer, type the following in Rstudio, followed by CTRL + Spacebar (try it):

stats::

If you are using a package or base R functions, and you do not know what the inputs are: do the following:

Type ``?’’ before the command to get info in the Help page in Rstudio (or ??xxx for internet help)

?stats::acf

Type the command, e.g.: acf to see a function’s code:

stats::acf

To execute a function - you have to end it with brackets. Most of the time, a function requires inputs. These are given in the brackets themselves (as below).

To know what to input, simply go ?stats::acf to see documentation.

To just scan the parameters needed in a function, type stats::acf() and put cursor in brackets, followed by CTRL + Space. This makes finding functions in packages and their respective parameters to include super easy…

Try it yourself - as explained in the gif, rnorm creates random normal variables.

stats::acf(x = rnorm(100))

External packages

Once you have identified a package online that you would want to use, you have to first download it to your computer (R knows where to save it.).

A package could either be on a peer-reviewed and verified platform (e.g. CRAN), or could be on someone’s github page, e.g. (unverified).

Let’s quickly install xts - a very nice time-series package we’ll be using quite a bit.

install.packages("xts")
# Once the package is installed on your computer, you can call it using the following command:
library(xts)

The package, , has now been loaded onto your computer using install.packages. You now never have to do this again (only for updates). But note that each time you open R, you have to tell it which packages need to be loaded using library. Typing the above, is now active for the whole session, until you close R.

You could add the following in code so as to install it only if it has not yet been installed:

# As you have already now installed dplyr, it shouldn't now install again.
# Please use this in your code:
if(!require(dplyr)) install.packages("dplyr") # This installs dplyr if you don't yet have it...
library(dplyr)

Pacman

Pacman, or package manager, could also be used.

This package allows you to load and install packages easily and intuitively.

if(!require(pacman)) install.packages("pacman") # This installs pacman if you don't yet have it...
# Using p_load, pacman will try to load a package, or try install it if you have not already installed it:
pacman::p_load(SomePackage)

Other functionality include:

Arrays, Matrices & Apply

Arrays

Ray <- array(c(1:100), dim=c(2,5))
#Note what happened in your Rstudio console when executing this command.
#Let's give the data row and column names as follows:
colnames(Ray) <-c("Men","boys","women","children","babies")
rownames(Ray) <- c("satisfaction","Communication")
# ... By the way, I have no idea what the above names imply...

We won’t be using this often, but just to note.

Matrices

Matrices are vectors with an additional atribute of dimension.

We can use rbind (row bind) and cbind (column bind) to add together vectors or arrays.

A matrix in R is thus merely a vector with two additional attributes: number of rows (nrow(X)) and number of columns (ncol(X)). Consider, e.g., the following matrix construction:

xx <- seq(from = 1, to = 16, by = 3)
# Note what seq() does...
yy <- seq(from = 1, to = 100, by = 18)
#Now let's merge these into a single matrix:
mat <- 
  matrix(cbind(xx,yy), nrow=6,ncol=2)

Data frames

Data frames allow us to view data more effectively and manage our analytics process more easily. Note that it can be slower in long estimation procedures, where arrays / matrices could be much quicker to use.

I strongly recommend getting used to using tibbles as default opposed to data.frames (see this link for motivation).

data frames in R require columns to be of the same type (e.g. numeric, character) and columns combined must have the same number of rows.

df <- data.frame(id=1:20, name = LETTERS[1:20], state = rep(c(TRUE,FALSE), 10))

df

##    id name state
## 1   1    A  TRUE
## 2   2    B FALSE
## 3   3    C  TRUE
## 4   4    D FALSE
## 5   5    E  TRUE
## 6   6    F FALSE
## 7   7    G  TRUE
## 8   8    H FALSE
## 9   9    I  TRUE
## 10 10    J FALSE
## 11 11    K  TRUE
## 12 12    L FALSE
## 13 13    M  TRUE
## 14 14    N FALSE
## 15 15    O  TRUE
## 16 16    P FALSE
## 17 17    Q  TRUE
## 18 18    R FALSE
## 19 19    S  TRUE
## 20 20    T FALSE

pacman::p_load(tibble) # We will see later
x = rnorm(100)
y = rnorm(100)
z = rnorm(200)

df1 <- tibble(var1 = x, var2 = y)
# We could of course add columns that have text, or are contingent on preceding column values, e.g.:
df2 <- tibble(var1 = x, var2 = y, message = "Msg", contingent_Column = ifelse(var1 > 0, "Positive", "Negative"))
# Notice that dataframes should be balanced. The following will fail (check this yourself and argue why):

# df3 <- tibble(var1 = x, var2 = z)

You could of course also amend your existing dataframe to be a tibble - let’s see how this is done.

First, let’s make a quick digression and install a basic dataframe package from my github account to have illustrative data frames for the tutorials. The package is called fmxdat and can be found here.

pacman::p_install_gh("Nicktz/fmxdat", force = T)
# You could also use devtools for installing from github as follows: 
# devtools::install_github("Nicktz/fmxdat")

Now you can use the data from the package using fmxdat::

Let’s see an example of a non-tibble dataframe (say read in from excel, or Bloomberg), and see how to make it a tibble (and why…)

pacman::p_load(tidyverse) # We will see later
pacman::p_load(tibble) # We will see later

df_ugly <- fmxdat::ugly_df
head(df_ugly)

##         date      ARS_Spot     AUD_Spot      BGN_Spot     BRL_Spot     CAD_Spot
## 1 2012-01-04  0.0005578023 -0.026521362 -0.0002646378 -0.021568523 -0.011421320
## 2 2012-01-11  0.0016260163  0.005722599  0.0186619019 -0.016696677  0.006813469
## 3 2012-01-18  0.0019248609 -0.012168248 -0.0121483791 -0.019477277 -0.008238525
## 4 2012-01-25  0.0038654723 -0.015098613 -0.0186110746 -0.001131862 -0.006823576
## 5 2012-02-01 -0.0006917224 -0.010088744 -0.0042216712 -0.017393768 -0.005675595
## 6 2012-02-08  0.0018458699 -0.008704510 -0.0074697174 -0.007265179 -0.002603645
##       CLP_Spot      CNY_Spot     COP_Spot     CZK_Spot      DKK_Spot
## 1 -0.013487476 -0.0043657761 -0.029033123  0.002558032  6.962091e-05
## 2 -0.010117188  0.0033363519 -0.017545537  0.017935582  1.851787e-02
## 3 -0.018389172 -0.0005700442 -0.006488240 -0.025031086 -1.218345e-02
## 4 -0.013065327  0.0035331210 -0.015657143 -0.029182252 -1.873411e-02
## 5 -0.006924644 -0.0041521945 -0.006275121 -0.006672378 -4.248493e-03
## 6 -0.014663659 -0.0021085340 -0.011427745 -0.022085729 -7.754271e-03
##        EUR_Spot     GBP_Spot     HUF_Spot     INR_Spot      JPY_Spot
## 1 -0.0001545237 -0.010307298  0.035157393 -0.001483403 -0.0156530665
## 2  0.0185724404  0.018983626 -0.011913416 -0.020434167  0.0016944734
## 3 -0.0121278084 -0.007060500 -0.030449176 -0.028472323 -0.0003903709
## 4 -0.0185411262 -0.014050326 -0.046665543 -0.006273926  0.0124967456
## 5 -0.0041790138 -0.011115321 -0.024143268 -0.016465497 -0.0203137053
## 6 -0.0074660633  0.001011506 -0.003262495 -0.002512491  0.0110236220
##        KRW_Spot     MXN_Spot      MYR_Spot     NOK_Spot     NZD_Spot
## 1 -0.0060569352 -0.020589076 -0.0096351287 -0.013854782 -0.023866954
## 2  0.0089840689 -0.006664234  0.0007974482  0.016707890 -0.011668758
## 3 -0.0150040552 -0.022030025 -0.0060557769 -0.012428425 -0.009199403
## 4 -0.0137347477 -0.021782416 -0.0130511464 -0.019757017 -0.015181195
## 5  0.0003552556 -0.009731930 -0.0119565924 -0.007248359 -0.018740990
## 6 -0.0094286856 -0.013480915 -0.0123972378 -0.009884282 -0.002635993
##        PEN_Spot     PHP_Spot     PLN_Spot     RON_Spot      RUB_Spot
## 1  0.0000000000 -0.005230244  0.021491151  0.009351151  0.0027140039
## 2 -0.0011125533  0.005600640  0.011483667  0.025402708  0.0007301682
## 3 -0.0005568962 -0.012843828 -0.039978375 -0.014514296 -0.0089694843
## 4 -0.0001857355 -0.007138745 -0.037434423 -0.019548694 -0.0252796598
## 5 -0.0013003901 -0.003942943 -0.019737652 -0.003438707 -0.0166271093
## 6  0.0001860119 -0.016882058 -0.008544055 -0.006447121 -0.0143787950
##       SEK_Spot     SGD_Spot     THB_Spot     TRY_Spot      TWD_Spot
## 1 -0.010290925 -0.008548325 -0.002535658 -0.023050107 -0.0005610931
## 2  0.014799865  0.002796334  0.008166508 -0.012036643 -0.0103034906
## 3 -0.013748181 -0.010611929  0.001985690 -0.013477089 -0.0003003103
## 4 -0.014801935 -0.011900102 -0.006700220 -0.013114754 -0.0001668892
## 5 -0.001112364 -0.008874099 -0.018431137 -0.027796235 -0.0099482557
## 6 -0.014239261 -0.003597410 -0.007743184 -0.003986787 -0.0047880770
##       ZAR_Spot
## 1 -0.011341787
## 2 -0.003852687
## 3 -0.016730715
## 4 -0.008507697
## 5 -0.025526629
## 6 -0.017688986

Notice that the above is ugly, vague and not useful in any way.

Making this data.frame a tibble nicely summarizes what is in the data.frame:

tibble::tibble(df_ugly)

## # A tibble: 294 × 32
##    date        ARS_Spot AUD_Spot  BGN_Spot BRL_Spot  CAD_Spot CLP_Spot  CNY_Spot
##    <date>         <dbl>    <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
##  1 2012-01-04  0.000558 -0.0265  -0.000265 -0.0216  -0.0114   -0.0135  -0.00437 
##  2 2012-01-11  0.00163   0.00572  0.0187   -0.0167   0.00681  -0.0101   0.00334 
##  3 2012-01-18  0.00192  -0.0122  -0.0121   -0.0195  -0.00824  -0.0184  -0.000570
##  4 2012-01-25  0.00387  -0.0151  -0.0186   -0.00113 -0.00682  -0.0131   0.00353 
##  5 2012-02-01 -0.000692 -0.0101  -0.00422  -0.0174  -0.00568  -0.00692 -0.00415 
##  6 2012-02-08  0.00185  -0.00870 -0.00747  -0.00727 -0.00260  -0.0147  -0.00211 
##  7 2012-02-15  0.00214   0.00954  0.0150    0.00424  0.00402  -0.00808  0.000906
##  8 2012-02-22  0.000506  0.00555 -0.0138   -0.0134  -0.000200  0       -0.000635
##  9 2012-02-29  0.000689 -0.00876 -0.00569   0.00698 -0.00990   0       -0.000365
## 10 2012-03-07 -0.00443   0.0141   0.0133    0.0282   0.00758   0.0196   0.00257 
## # … with 284 more rows, and 24 more variables: COP_Spot <dbl>, CZK_Spot <dbl>,
## #   DKK_Spot <dbl>, EUR_Spot <dbl>, GBP_Spot <dbl>, HUF_Spot <dbl>,
## #   INR_Spot <dbl>, JPY_Spot <dbl>, KRW_Spot <dbl>, MXN_Spot <dbl>,
## #   MYR_Spot <dbl>, NOK_Spot <dbl>, NZD_Spot <dbl>, PEN_Spot <dbl>,
## #   PHP_Spot <dbl>, PLN_Spot <dbl>, RON_Spot <dbl>, RUB_Spot <dbl>,
## #   SEK_Spot <dbl>, SGD_Spot <dbl>, THB_Spot <dbl>, TRY_Spot <dbl>,
## #   TWD_Spot <dbl>, ZAR_Spot <dbl>

Rows and columns

Tibbles are accessible in various ways. The most common base way is:

tibble[ rows, columns]

df <- df_ugly %>% tibble::tibble()

# To isolate the first row, fifth column:
df[ 1, 5]

# To isolate the entire third row:
df[ 3, ]

# To remove the second column:
df[ , -2]

# select the first four rows of columns 1, 4 and 7
df[ 1:4, c(1,4,7)]

Notice above our use of c(1, 4, 7).

tibble$Column

To isolate a specific column, you could use $

head(df$ARS_Spot, 10)

select

We could also use the logical selector from the dplyr package (we will get to using this more later):

library(tidyr)

select(df, ends_with("_Spot"))

select(df, starts_with("ARS"))

select(df, contains(c("BRL", "CAD")))

# Using a vector:
selectors <- c("date", "ARS_Spot", "AUD_Spot", 'BGN_Spot')

select(df, all_of(selectors))

# Using a function:
select(df, where(is.numeric))

grepls, gsubs and more…

You will use these verbs quite often when coding. Make a thick-underlined-and-yellow-highlighted note of the following (grepl and gsub!):

String <- c("SOME STRING", "Another STRING", "Last One", "SOME Other One")

# gsub: replacement tool
gsub(x = String, 
     pattern = " STRING",
     replacement = "")

## [1] "SOME"           "Another"        "Last One"       "SOME Other One"

# grepl: Identifying patter match
# Let's trim String to only include entries with the word "SOME"
grepl(pattern = "SOME", String)

## [1]  TRUE FALSE FALSE  TRUE

# Notice that grepl produces TRUE / FALSE - this is called a logical operator in R.
# We can use this to subset our string as follows:
String[grepl(pattern = "SOME", String)]

## [1] "SOME STRING"    "SOME Other One"

Notice above that String is a vector, not dataframe. Thus TRUE selects an entry, with FALSE causing a drop.

If we wanted to use grepl to subset columns, this can be done as follows (Notice my use of | to signify or):

df <- fmxdat::ugly_df %>% tibble::as_tibble()

# Let's now subset our dataframe by selecting columns of df that contain a "Z" or a "Y", as well as the date column
df[ , grepl("date|Z|Y", colnames(df))]

## # A tibble: 294 × 8
##    date        CNY_Spot CZK_Spot  JPY_Spot  MYR_Spot NZD_Spot TRY_Spot ZAR_Spot
##    <date>         <dbl>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
##  1 2012-01-04 -0.00437   0.00256 -0.0157   -0.00964  -0.0239  -0.0231  -0.0113 
##  2 2012-01-11  0.00334   0.0179   0.00169   0.000797 -0.0117  -0.0120  -0.00385
##  3 2012-01-18 -0.000570 -0.0250  -0.000390 -0.00606  -0.00920 -0.0135  -0.0167 
##  4 2012-01-25  0.00353  -0.0292   0.0125   -0.0131   -0.0152  -0.0131  -0.00851
##  5 2012-02-01 -0.00415  -0.00667 -0.0203   -0.0120   -0.0187  -0.0278  -0.0255 
##  6 2012-02-08 -0.00211  -0.0221   0.0110   -0.0124   -0.00264 -0.00399 -0.0177 
##  7 2012-02-15  0.000906  0.0320   0.0180    0.0107    0.00204  0.0112   0.0278 
##  8 2012-02-22 -0.000635 -0.0163   0.0237   -0.00320   0.00434 -0.00515 -0.00334
##  9 2012-02-29 -0.000365 -0.0161   0.0107   -0.0103   -0.00575 -0.00603 -0.0297 
## 10 2012-03-07  0.00257   0.0103  -0.000739  0.0117    0.0216   0.0187   0.0120 
## # … with 284 more rows

# Very important to note we can have or in grepl by using | between the entries

Lists

In contrast to vectors or columns in a tibble (where all elements in a column must be of the same type), we can use lists.

List structures can combine objects with different classes, which helps in organizing your data in a better workable format.

Lists can contain anything (dataframes, plots, vectors, etc) - while extracting information from it can be done using:

Smith <- list(name="John", title = "President", firm="Google", salary = 500000)

print(Smith)

# To acces buckets in the list, we can use $ or [[]]:
Smith$title
# Which is equivalent to:
Smith[[2]] # double [[k]] imply ``entry at k''

# Lists can also store characters and values:
# e.g. calculating the salary in 1000s:
Smith$salary / 1e+3

# Note - length tells us how many elements are in the list:
length(Smith)

Note from the above: [[ ]] unboxes a list; [] unboxes a data.frame or single list element

The following cannot be done to lists:

l <- list(a=TRUE, b=c("a","b"), c=1:100)
l[[1]]                   # works: first element of a list
l[[2]][1]                # works: first entry of second element in list.
l[[c(1,2)]]              # error - cannot return two objects, without a list to hold them
l[[-c(1,2)]]             # error - negative indices cannot be used for list elements
l[[c(TRUE,FALSE,TRUE)]]  # error - logical indices cannot be used
l[[c("a")]]

It can even store dataframes and plots.

Work through the following code to understand more about subsetting, formatting and wrangling base dataframes:

pacman::p_load(tibble)
df_TRI <- fmxdat::DailyTRIs

# Let's quickly remove the ' SJ' from all column names first.
# This requires us using colnames and gsub:

colnames(df_TRI) <- gsub(" SJ", "", colnames(df_TRI))

# Let's get each column's max value (we will explain apply next):
Max_vals <- apply(df_TRI[,-1], 2, max, na.rm=T) 
Max_vals

##      ABI      ABL      ACL      AEG      AGL      AMS      ANG      AOD 
##  9195.00     -Inf 17929.22     -Inf 30743.92  1798.53  5378.38  9300.00 
##      APN      ARI      ASR      AVG      BAW      BGA      BIL      BTI 
##     -Inf     -Inf     -Inf  1180.00  1244.47  1117.64 11307.96     -Inf 
##      BVT      CCO      CFR      CRH      CRN      DDT      DRD      DSY 
##  7937.50     -Inf  3650.00   573.32   473.11   403.00  3628.02  3199.00 
##      ECO      EXX      FSR      GFI      GRT      HAR      IMP      INL 
##  7900.00  7500.00   575.00  1718.86     -Inf  3075.83  3130.97  3800.00 
##      INP      IPL      ITU      JDG      JNC      KIO      LBH      LGL 
##  1383.73  9750.00  8321.64     -Inf   358.78     -Inf     -Inf  7791.02 
##      LHC      LON      MDC      MND      MNP      MPC      MSM      MTN 
##     -Inf     -Inf     -Inf     -Inf     -Inf     -Inf     -Inf     -Inf 
##      MUR      NED      NPK      NPN      NTC      OML      PIK      PPC 
##     -Inf  1345.00  1614.00  5933.33     -Inf  1870.00     -Inf     -Inf 
##      REI      REM      RLO      RMH      SAB      SAP      SBK      SHF 
##     -Inf  1725.00     -Inf     -Inf  1244.00  1390.00     -Inf     -Inf 
##      SHP      SLM      SOL      TBS      TKG      TRU      VNF      VOD 
##     -Inf  1300.00 13055.00  7791.02  4805.53     -Inf  2600.00     -Inf 
##      WHL 
##  1085.43

# Let's replace the is.infinite values with zero next (max of a column of NAs gives -infinity):
Max_vals[is.infinite(Max_vals)] <- 0 
# Go through the above line of code to understand what is cooking here..

Min_vals <- apply(df_TRI[,-1], 2, min, na.rm=T) 
Min_vals[is.infinite(Min_vals)] <- 0 

Result <- list()
Result$Name <- "Result of Max Returns"
Result$Max <- Max_vals
Result$Min <- Min_vals
Result$LastDate <- tail(df_TRI, 1) # Notice my use of tail here... Converse is head.

Notice that the object now created contains strings, vectors and a dataframe.

To find the highest max, use:

 Result$Max [ which( Result$Max == max(Result$Max) ) ]

##      AGL 
## 30743.92

The above is a nice line of code to learn from:

$ is used to subset a list by name (as opposed to position, which would be e.g. Result[[2]])
which is great to use for subsetting. Use as which( vector, function)
- e.g. which( rnorm(100) > 0) or if you want to subset it: rnorm(100)[which( rnorm(100) > 0)]
‘==’ means equals. Could also use e.g. ‘!=’, ‘>=’, ‘<=’

Apply

A very useful set of functions used in base R are the Apply functions.

Apply returns a vector or array (or list) of values obtained by applying a function to parts of an array or matrix.

# Let's create a function for creating normal data:
# Note :::: function(inputs){commands}
NormalDataGen <- function(n, mu, sd){ rnorm (n, mean = mu, sd = sd) }
a <- NormalDataGen(n = 100, mu = 5, sd = 2)
b <- NormalDataGen(n = 100, mu = 3, sd = 2.3)
c <- NormalDataGen(n = 100, mu = 2, sd = 1)
d <- NormalDataGen(n = 100, mu = 4, sd = 4)
e <- NormalDataGen(n = 100, mu = 3, sd = 2)
data <- data.frame(a,b,c,d,e)
class(data)
# Now, using apply note the following:
# apply(data, 1 is row and 2 is columns,function to be done)
# Using this, to calculate the mean of all the columns, simply use:
apply(data,2,sum)
# Another SUPER USEFUL addition to this is that we can add commands
# that are part of the function. E.g. had we calculated column means,
# we could use "na.rm = TRUE"" (remove NAs) and "trim" to trim for outliers.
# Adding these commands to mean should not be done in brackets, but after commas
apply(data, 2, mean, na.rm = TRUE, trim=0.35)
# Note that na.rm and trim belond to the mean function, not apply...

How would you know what ``trim’’ does in the code above? USE STACKOVERFLOW. E.g., google the following:

what does trim do in apply r stackoverflow

You will likely find this useful answer.

Let’s do something a bit more advanced. In our df_TRI dataframe, let’s select only columns that have no NA.

We will use:

which to trim our columns
!is.na() to identify which entries are not NA. The ! is always used to mean not in R.
all to see if all entries in a column are not na.

df_TRI[, which( apply( !is.na(df_TRI), 2, all)) ]

## # A tibble: 501 × 30
##    Date          ACL    AGL   AMS   ANG   BAW   BGA    BIL   BVT   CFR   EXX
##    <date>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 2003-01-02 13078  26924.  756. 2966.  838.  592. 10140. 6794.  2300  2350
##  2 2003-01-03 12975. 27523.  752. 2980.  838.  597. 10345. 6825   2320  2400
##  3 2003-01-06 12718. 28764.  737. 2985.  834.  593. 10879. 7062.  2305  2400
##  4 2003-01-07 12913. 27992.  752. 2990.  833.  605. 10423. 7188.  2310  2421
##  5 2003-01-08 12973. 28085.  742. 2961.  848.  604. 10345. 7312.  2272  2515
##  6 2003-01-09 12924. 28086.  747. 2978.  853.  590. 10354. 7488.  2260  2600
##  7 2003-01-10 12666. 28091.  747. 2998.  853.  597. 10521. 7075   2260  2530
##  8 2003-01-13 13335. 28563.  742. 2857.  861.  605. 10573. 7275   2259  2529
##  9 2003-01-14 13403. 29209.  743  2857.  864.  620. 10778. 7250   2260  2505
## 10 2003-01-15 13552. 28132   742. 2862.  856.  620. 10192. 7500   2280  2549
## # … with 491 more rows, and 19 more variables: FSR <dbl>, GFI <dbl>, HAR <dbl>,
## #   IMP <dbl>, INL <dbl>, INP <dbl>, IPL <dbl>, ITU <dbl>, LGL <dbl>,
## #   NED <dbl>, NPK <dbl>, OML <dbl>, REM <dbl>, SAB <dbl>, SAP <dbl>,
## #   SLM <dbl>, SOL <dbl>, TBS <dbl>, VNF <dbl>

That is a powerful line of code - make sure you understand it as it encompasses all we’ve done to now.

Note: There is a whole family of apply, sapply, mapply, etc. that we won’t go into. But check it out, it’s a powerful suite of commands…

And or…

And and or can be useful commands in logical coding. They are written as: & and | respectively.

x <- tibble(RandomData = rnorm(100,50,2))
# Let's only keep data that lie between 48 and 50 (for whatever reason...)
z <- x$RandomData[ which( x$RandomData < 50 & x$RandomData > 48) ]
#==== Note:
# x$RandomData is the column
# using square brackets, like : x$RandomData[1:5], gives me e.g. the first five positions
# "which" gives me the positions that satisfy it being <50 & > 48... so:
# x[ which( x$RandomData <50 & x$RandomData >48) ] thus does the job!

Loops…

Let’s look at the workhorse of programming: Loops.

Loops allow you to tell R to repeat something over and over again. See this example:

x <- rnorm(1, mean = 100, sd = 10)
# If statement:
if (x < 100){
  m <- "Heads!"
}else{
  m <- "Tails!"
}
print(m)

## [1] "Heads!"

# Note that if you repeat the abovestatement - it is the same as tossing coins...

Now such statements of logic often appear in loops to guide R to do what you want it to do. E.g.:

sqr <- c() # Create an open vector to be filled:
x <- seq(from = 1, to = 100, by = 10)

# Loop to square every entry of x:
for(i in 1:length(x)){
  sqr[i] <- x[i]^2
}

print(sqr)

while

While could be used to repeat a routine until an objective is reached.

E.g.: we create a dataframe x, and multiply its entries by more random data - until the maximum of the original column exceeds 3.

x <- tibble(Random = rnorm(100))   

while( max(x$Random) < 3){
x <- bind_cols(x, Another_Random = rnorm(100, 1, 0.1))   
x$Random <- x$Random * x$Another_Random
x <- x[,-2]
print(max(x$Random))
}

while functions are invaluable in certain applications - e.g. I recently built a function that proportionally reallocates weights from stocks with weights > 10%, until there is no stock above 10% (after each allocation - it is possible another stock now exceeds the 10% threshold requiring constraining and reallocation.)

Pasting things together

Side note: paste is powerful, as it pastes together text. Compare paste(…, sep = “/”) and e.g. paste0(…). Replace … with text strings, e.g.:

t <- "text1"
t2 <- "text2"
paste(t,t2,sep = "/")
paste0(t,t2) # No need for sep, implied that sep = ""

glue…

Even better than paste is to use glue.

Install the package first (and also another one we’ll use later: lubridate):

pacman::p_load(lubridate, glue)

It is now even easier to paste, see this example:

date <- lubridate::today()
randomValue <- rnorm(1)
Mood <- ifelse(randomValue > 0, "happy", "sad")
print(
  
  glue::glue("
\n=======\n
This is an example of pasting strings in text.
On this day, {date}, my mood is {Mood}, as my value is {ifelse(randomValue>0, 'positive', 'negative')}.
That's pretty cool right? ...note here will now follow\n\n a few spaces...
\n=======\n
       ")
  
)

## 
## =======
## 
## This is an example of pasting strings in text.
## On this day, 2023-02-06, my mood is sad, as my value is negative.
## That's pretty cool right? ...note here will now follow
## 
##  a few spaces...
## 
## =======

Notice my use of ifelse above. Same notation here as in excel. We will use if, or, while, etc a lot in coding.

rmsfuns package

rmsfuns is a wrapper package that offers several convenient wrappers and commands that we will often be using in this course. See the details of this package here and the source code here.

Install the rmsfuns package as follows:

if (!require("rmsfuns")) install.packages("rmsfuns")
library(tidyverse)

Below is a quick outline of the wrapper functions that we will be using a bit:

dateconverter

This function allows you to fill in dates between a starting and end date. Again, this will save you pain in R and is useful to specify date strings:

library(rmsfuns)
dates1 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "alldays") 
dates2 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdays") 
dates3 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "calendarEOM")  # Calendar end of month
dates4 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdayEOW")   # weekday end of week
dates5 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdayEOM")   # weekday end of month
dates6 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdayEOQ")   # weekday end of quarter
dates7 <- dateconverter(as.Date("2000-01-01"), as.Date("2017-01-01"), "weekdayEOY")   # weekday end of year

ViewXL

If you want to quickly view any dataframe from R straight in excel, use ViewXL. This will open a excel sheet in your temporary folder, which you can proceed to save to a different location. This is great for quickly viewing your calculation, double checking what you’ve done in R, or simply explore your file in excel.

Let’s create a random data frame, with a date column and random returns:

datescol <- dateconverter(as.Date("2015-01-01"), as.Date("2017-05-01"), "weekdays") 

RandomReturns <- rnorm(n = length(datescol), mean = 0.01, sd = 0.03) 

# This creates a vector of random data with a mean of 0.01 and sd of 0.03
df <- data.frame( Date =  datescol, Return = RandomReturns)

# Let's now quickly view this in excel..
ViewXL(df)

Notice that I intentionally printed the random file location when the function is run - which you can then proceed to delete after viewing it (so as not to clutter your pc).

PromptAsTime

Sometimes when running long calculations, you may find that you want to time your calculations.
While there are many ways to skin this cat, it might be easiest for you to make your prompt (what you see on the bottom of your Rstudio screen), be the current time. Now you can track through your estimation how long each step had taken (explicitly saving time could be done using Sys.time()):

PromptAsTime(TRUE)
x <- 100
Sys.sleep(3) 
x*x
print(x)
PromptAsTime(FALSE)

Now you can see after the fact, that x*x took 3 seconds (not really, I purposefully made it lag - but you get the point!).

build_path

It can be challenging building path structures in R, but build_path makes this easy. This could e.g. be used in a function that creates folders and populates it with figures / data points. E.g., suppose you want to create a folder structure as follows:

/Results/Figures/Financials/Figures/Return_plot.png
/Results/Figures/Industrials/Figures/Return_plot.png
/Results/Figures/HealthCare/Figures/Return_plot.png

To create the folders to house the plots (and save their locations in a vector to use it later), run:

# Specify a root on your computer, e.g.:
Root <- "C:/Finmetrics/Practical2/Folder_Example"

# Specify the sectors:  
Sectors <- c("Financials", "Industrials", "HealthCare")

# Construct the structure and bind them together:

# base R's paste0:
Locs <- build_path( paste0(Root, "/", Sectors, "/Figures/") )

# glue's glue... (easier to use - see later my summary of this call):
Locs <- build_path( glue::glue("{Root}/{Sectors}/Figures/") )

Directories

Setting the directory

It is important to first, before anything else, save your R directory to somewhere where you will be able to locate it at a later stage. I strongly suggest though always working from a R.proj file.

In fact, your should always work from a R.proj file. Always

Step 1: Create a Folder within which to store your work.
Create a Rproj for the folder (from R studio: file/new project. ALWAYS work within this project environment - it makes life MUCH easier managing your work.

RProjects are awesome because:

+ It sets the working director to the folder

+ It keeps your working changes, which is useful if you forgot to save, e.g.

+ It allows you to easily navigate through your child-folders.

Create separate folders for your Code, Data and binary files (pdfs, jpgs, pngs, etc.).

This should be standard for when starting any project.

I show you below first how to do this by hand, and second how to use fmxdat’s automatic way (much quick to set up a good folder environment. Please check both):

You can do this as follows.

1.) Create a directory on your PC where you want the project to reside (spend time thinking about intuitive directory structures).

2.) Highlight the directory address and copy it (Ctrl + C)

3.) From within any R studio instance, simply run:

fmxdat::make_project(Open = T)

This gives you a structure resembling (and if set Open = T, it will open this folder as well):

Notice that a project window has now opened (when you selected Open = T).

On the bottom right panel - open the README.Rmd file.

This README file will be your diary as your project develops.

README

When opening your README.Rmd file after opening your newly created .Rproj file, your screen in Rstudio will look as follows:

Your README.Rmd file is split into a text part (for title use hash, subtitle double hash, etc) and then also a code part (shaded part between the {r} and).
Type what you are doing (purpose and steps), while placing your code to be executed in the code chunks (created by pressing Ctrl + Alt + i)
The code chunks then separate code from text.
You can then create htmls of your README (by pressing Ctrl + Shift + k, try it: how cool is that? We can discuss knitting htmls and pdfs at a later stage) - which is great for documentation purposes.

As an example, see the Texevier package’s README here which we will use later (it is a public package so you can view the repository directly): https://github.com/Nicktz/Texevier

Notice I have an image embedded there, as well as text and code chunks. You can use your README for similar transparency in what you do.

Notice that github automatically transforms the README.Rmd file into an html landing page below the code - imagine all your projects looked like this how that will increase your productivity!

After you’ve worked on your project, your Readme.Rmd diary becomes you manual.

Next time you open your project - be sure to open it by double clicking on the .RProj file

Getting data into R

There are very many ways to get data into R (openxlsx, readIDSm readxlm etc. etc.).

Which you use really mostly depends on what data source you are loading (and size). There are also several new platforms being developed for loading and saving data. Below follows a summary of how I currently prefer to load data (although this will likely change as a more stable version of feather is launched).

If you are unsure of how to load data, you might find this blog useful. It provides you with the code to load a dataset saved in a range of formats.

I prefer storing raw data that should be human readable (e.g. settings files) in csv format, and then using the readr package for loading data. For data you want to access later from R (i.e. saving R dataframes) I prefer using rds file storage (as one can set the compression easily - useful for BIG datasets).

For this, the readr package can be used (loaded as part of library(tidyverse) ).

Do the following to understand the process on your side:

Store a dataframe in a folder called “Data” within your directory:

pacman::p_load(tidyverse)
dta <- fmxdat::BRICSTRI
# Create a folder called data in your root.
# If you followed my gif's steps above - you are currently in a .Rproj environment, meaning you can straight-up use:
dir.create("Data/")
# And now let's store our data file in here:
write_rds(x = dta,
          path = "Data/Example_File.rds")

# And now to load it:
df <- read_rds(path = "Data/Example_File.rds")

read_csv functionalities

In addition to easily and more intuitively loading your data and better predicting the column types, readr allows the user to specify the types of columns you are loading (and even set defaults).

E.g., if I know my data’s first column is a date column, then most of my columns are character columns, except for a few exceptions, I can specify the loading to be as follows (see here for a Stackoverflow Q and A on this):

data <- read_csv( "/Data/Indexes.csv", col_types = cols(.default = "d", Date = "D"))
# Above, d stands for dbl (used for numerical columns)
# If you want to specify that a column contains characters, use "c"

For those interested, to see what other column types you get view this vignette.

Viewing and working with your datafiles

Returning back to our loaded data, we can view it as follows:

View(data)
head(data, 10) # Top 10
tail(data, 10) # Bottom 10

A few useful commands:

Working with NA’s:

Count all the NA’s per column:

colSums(is.na(data))

Counting other things per column:

# na.rm = TRUE tells R to count all the non-NA items in each column:
colSums(data == 0, na.rm = TRUE)
colSums(data >= 500, na.rm = TRUE)

Remove all rows with NAs:

dataNoNA <- na.omit(data)

Set all NA to zero:

data[is.na(data)] <- 0

Dealing with Dates

Dates in R can be converted to a formal date column using several commands.

A simple way of doing this is to use the as.Date() command. See this page for more details and examples. To create a date column, we could e.g. use:

data.frame(date = seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))

Suppose you imported an excel file with data, and R failed to read your date column as a valid date column, you could tell it that it is a date column using:

df$Date <- as.Date(df$Date, format="%d/%m/%Y")

Where Date is your date column, and the format is day, month and year, separated by a forward slash. Change the above to suit your example if you run into problems.

We will have more date setting examples later in the course.

Wrangling with dates

Suppose you imported a Total Return Index series, with daily dates, but you wanted to only focus on weekdays, or Mondays, or even specific months (e.g. to study the January trading effect). This could be done as follows.

Let’s first create some fake data for illustrative purposes:

    pacman::p_load(dplyr)
    df <- 
    tibble(
    date = seq(as.Date("2012-01-01"),as.Date("2015-08-18"),"day"),
    TRI = rnorm(1326, 0.05, 0.04))
    # Let's now create a function that only focuses on weekdays:
    dow <- function(x) format(as.Date(x), "%a")
    df$day <- dow(df$date)
    df[ !grepl("Sat|Sun", df$day), ]
    
    # To focus only on particular months, let's say January and February:
    dom <- function(x) format(as.Date(x), "%b")
    df$Month <- dom(df$date)
    df[ grepl("Jan|Jun", df$Month), ]
# Note: Use the above only after loading package dplyr, 
# and setting your data.frame in tibble::as_tibble format.

Functions

Funcions are the absolute building block of coding in R. It is always recommended to break down your workflow in separate functions, each specializing in one thing, to create a suite of functions that make up your analysis.

Think of functions in R as the cells that make up your body: small, squishy and important for life.

See this source for more information on functions in R.

Below is a simple function to illustrate the notation. In the next session we will elaborate much more:

x = rnorm(1000)
# Let's now create a function to replace all positive values with the word positive:
Max_Value <- function(vector) { 
  
    Maximum <- max(vector)
    
    return(Maximum)

    }

Max_Value(vector = x)

## [1] 3.390371

# Notice that Maximum was not saved anywhere as the function's environment was temporary:
exists("Maximum")

## [1] FALSE

We will discuss in our next session how to structure your workflow and manage your projects - motivating the use of functional programming, and coding in a structured & replicable way.

Conclusion

We will be next time be discussing functions and project workflow in R.
This will be a bit of a philosophical session, but important for understanding how best to structure and plan your projects (and life) in R

Applied Data Science

Introduction to R

NF Katzke