The aim of this tutorial is to introduce you to functional coding, while explaining how to code in a structural way.
Let’s now create a .RProj file for this session to save all your notes! (Always start projects this way)
Use fmxdat::make_project to make a default folder structure that you should use please.
Do this by creating any folder (e.g. create C:/Temp_Go/someproject), copy the address, and then simply run the following code from any R session’s console:
::make_project(Open = T) fmxdat
This should create a folder structure looking like:
Two broad schools of coding thought has emerged over time: Object oriented and functional.
While the progression in coding paradigms has not been linear and clean, most would agree that the two are the main structures.
Since the development of Fortran in 1957 (Formula Translator, an unstructured and intentionally mathematical language), other paradigms developed - specifically seeing the advent of variable assignments, conditionals and loops.
Cobalt in 1959 was an attempt to make programming languages more like a natural language - making it easier to learn and more accessible, but more limiting as well. Since then, various other languages and paradigms emerged - all with benefits and costs associated.
As systems became more complex and interlinked, a need for structure in code emerged. Esgar Dijkstra is credited for highlighting the need for, and developing the first structured coding paradigm in 1969.
This was arguably the beginning of functional, or structured programming, whereby the coder is forced to asssemble arguments, building structure and allowing the ability to test their validity.
The main aim is to move from speed of coding and result, to process consideration and stability.
OO paradigms are attributed to Alan Kay in 1966 with the advent of Simula. The main idea is for polymorphism (e.g. a man can be both a father, son and husband at the same time - ie multiple attributes depending on context).
Funcitonal coding paradigms seek to constrain tasks - writing code that do one thing (Return_Calculation) and produces no side-effects.
This means breaking your workflow down into smaller parts - allowing the ability to test and check functions, while also avoiding having functions interact with environment variables.
The following is an ideal example of what we are trying to achieve:
<- 10
x # Specify function with input variable Y:
<- function(Y){
Vegas
<- Y ^ 2
x
x
}# Execute function, setting output equal to new variable `result`
<- Vegas(Y = 100)
result
# Notice now, that although x was assigned in the function... it stayed only assigned in the function.
# So x remains 10, while result is equal to 10 000
print(x)
## [1] 10
print(result)
## [1] 10000
Notice too from this example that Y is not in your environment variable. Basically, what happens in Vegas stays in Vegas.
In summary, functions are used to take inputs and transform them to an output. Crucially, inputs are not changed in any way, and functions produce no side effects. That is a virtue in itself.
Java, C++, Python and more can be bucketed in the OO paradigm of polymorhpism.
Effectively, it is a messaging paradigm, where context of a command matters.
So if I run plot(X) and plot(y) - the output will be determined by the type and nature of X and Y (same function, but input object determines output). This is polymorphism, am I now a husband, dad or a son depends on the context.
Functional Programming seeks to be indifferent to context and structured - plot_histogram seeks to produce a histogram, always.
In summary, we use a fluid combination of both OO and functional. I certainly use both in any project. This session describes both, and there is not a right or wrong at all tmes. Horses for courses, but please try and prize stability and structure in everything you do - which is core to functional coding.
As mentioned, when running a function - it does so in a completely new environment.
This way, if you execute a function multiple times, the exact same outcome is produced:
<- function(x){
foo <- x^3
x
x
}
foo(x = 10)
## [1] 1000
foo(x = 10)
## [1] 1000
foo(x = 10)
## [1] 1000
Below we illustrate the use of functions in R with examples - followed by explanations for what is happening.
= rnorm(1000)
x # Let's now create a function to replace all positive values with the word positive:
<- function(vector, Threshold = 0, Replace_Name) {
Cap_Randoms
# Just for illustration:
<- 1000
z # What function is doing:
>= Threshold] <- Replace_Name
vector[vector
vector
}
head( Cap_Randoms(vector = x, Replace_Name = "Positive"), 10)
## [1] "-1.32513427435075" "Positive" "Positive"
## [4] "Positive" "Positive" "Positive"
## [7] "-0.313836352808351" "Positive" "Positive"
## [10] "Positive"
Notice from the above:
The function has been named Cap_Randoms, with 3 possible inputs given in the brackets.
Notice that Threshold was given a default value of zero, which can be overwritten when calling the function.
Replace_Name has no default, and must be provided.
The only thing coming out of the function is vector (last thing function does)
This can be explicitly set using return(vector), but it could also simply be the last element as above.
Check that has not been saved in your work environment, so running print(z) after running the above script will produce an error.
This last point is vital to understanding the power of R functions - it is sanitized.
It only deals with what comes in - and only produces what you want it to produce.
Let’s play with the above function some more:
# Let's overwrite the default to set Threshold to 2, and name it "TWO or above"
<- Cap_Randoms(vector = x, Threshold = 2, Replace_Name = "TWO or above")
Result # Notice that z, which is create internal to the function, is not created in our environment:
exists("z")
## [1] FALSE
# We can also add messages in functions, e.g.:
::p_load(glue)
pacman<- function(vector, Threshold = 0, Replace_Name) {
Cap_Randoms_Msg
# Just for illustration:
<- sum(vector > 0)
Count <- length(vector)
SUM
# What function is doing:
>= Threshold] <- Replace_Name
vector[vector
message( glue::glue("Percentage of positive elements = {Count/SUM * 100}%"))
vector
}
<- Cap_Randoms_Msg(vector = x, Replace_Name = "Positive") Result
In the above example, notice that a message was printed. This message was not saved in the environment, nor Count or SUM. The only element going from function -> environment was vector.
This is to be preferred.
Notice from the last schematic, the output of the generic function plot is guided by the input.
So - dependent on the input, the output of plot is different.
This means that functions can be generic (e.g. plot), and inputs are specific in a OO workflow (e.g. Result and Result2).
This means your objects have multiple facets - it is not just e.g. a dataframe, but has other elements too (e.g. guiding plot to produce a lineplot).
Functional, however, forces your function to be specific: it requires a particular input in a particular form (ideally - there are exceptions of course)
This means you’d have two functions: plot_line and plot_bar, where the function guides the output (not the input).
Think about the above and reason for yourself which process you are most comfortable with.
In this course - I prefer the workflow of having many functions - each having a specific role - as opposed to one long script that does 1000 things.
Although functions in R are flexible and can be used as inputs, have flexible parameter inputs, etc. - one thing you should strive toward (nearly at all times) is that your functions are pure
Purity in functions imply:
The output only depends on the inputs (no environment variable dependencies. Ever.)
The function has no side-effects (e.g. changing a global variable value). What comes out of a function should be explicit.
You could, in principle, structure an entire project around functions that each do one thing.
E.g.:
<- list( c("AUS", "NZ", "SA", "UK", "US" ) )
Universes # Load
<- load_foo( Load_List = Universes)
df # Wrangle
<- wrangle_foo(df)
df_wrangled # Plot
<- plot_foo(df_wrangled)
Plot # Print plot
plot(Plot)
Object Oriented coding, in contrast, creates class objects, to which you could, e.g., apply various calls to.
Many older R packages still make extensive use of class objects.
An example of this principle would be:
<- do_all_Foo( Load_List = Universes )
X # To produce a histogram plot, use 'plot' on the object X:
plot( X )
<- do_something_else_Foo( Load_List = Universes )
Y # To produce now a lineplot for Y, use again 'plot' on Y:
plot( Y )
Notice that X and Y are plotted simply by calling ‘plot,’ which then produces a histogram plot for X, and a lineplot for Y.
The ‘plot’ is determined by the input class - the first, say, produces an output with class ‘My_Foo’ - with a structure that allows a histogram to be produced with plot (and alternatively lineplot for Y based on its object class).
The question is: should we be explicit about the plot created (i.e. have an explicit function called histogram_plot and line_plot producing each) - or should the output of ‘plot’ be governed by the output object?
Your choice in answer to this question should guide you i.t.o. what paradigm you are most comfortable coding in. Although it might seem tedious to have functions for all your operations, it ultimately leads to cleaner, more clearly defined and easier to debug code. The choice is yours, but at least be consistent in how you code.
Functions in R are like meat factories.
You specify a name (Joe’s meats), explicitly state what goes in and then explicitly state what goes out.
Simple. That’s it.
Your meat factory should:
Have all internal elements listed as parameters
Check this, otherwise you could easily be dissapointed with the results from your efforts, as this guy learned (he did not ensure that all internal objects in his functions were explicitly defined):
Vague name, external dependencies not defined as a parameter (y), and also not explicit about what exits the function:
<- function(x){
foo <- x * y
z
}
<- 100
y <- foo(x = 10)
Result
Result
## [1] 1000
No external dependencies, explicit about what exits (although a great function would’ve had a more informative name obvs):
<- function(x, y){
foo
<- x * y
z
return(z)
}
<- foo(x = 10, y = 100) Result
Quick note on return use:
It stops a function and returns your wishes, so what follows is irrelevant:
<- function(x, y){
foo
<- x * y
z
return(z)
<- z * 2000
g
return(g)
}
<- foo(x = 10, y = 100)
Result
print(Result)
## [1] 1000
In the above case, you might as well not use return, and only have the function end with that which you wish it should.
Use return and stop to break functions and return your wishes.
Let’s show this below. I’m also going to throw in (for free) an illustration of how to handle possible breaks without R screaming and stopping in a tantrum: purrr::safely.
Result <- purrr::safely(any_function) will produce: a result and error - where one of them will be NULL. See example below of how to use:
::p_load(purrr)
pacman<- function(x, y){
foo <- x * y
z if(z > 999) stop( glue::glue("\n\nOh NO!\nvalue of z exceeds 999. The value was {z}. \nStop and try again please... Cheers, the developer.") )
z
}<- purrr::safely(foo)
safe_foo
# Here it will fail:
<- safe_foo(x = 10, y = 100)
Result print(Result$result)
print(message(Result$error))
# Here succeed:
<- safe_foo(x = 1, y = 100)
Result
print(Result$result)
print(Result$error)
# So nnow save the result, and you are off to the races:
<- Result$result Result
Use safe wrappers for your functions where you envisage possible breaks - but where a warning or message will suffice. This way, you can carry on with a loop even if an error occurs.
Let’s create a few more examples to ground the functional intuition.
These functions might require you to spend time understanding what is going on - I gaurantee it will aid your understanding if you spend time with this.
Below is a simple example of a function that you can use for calculating the standardized return values of the BRICSTRI file.
We will use it with apply to apply the function to all columns:
<- fmxdat::BRICSTRI
df_tri
<- function(Column){
my_std
= Column / lag(Column) - 1
Column mean( Column, na.rm = T) / sd( Column, na.rm = T )
}# Now we can use this function directly in apply:
apply(df_tri[,-1], 2, my_std)
## brz chn ind rus zar
## NaN NaN NaN NaN NaN
# Note we had to ignore the date column using [, -1].
# This could also be explicitly built into the function to ignore columns that are not NA.
# We will use here the more geneirc sapply to identify classes.
# Note what it does:
sapply(df_tri, class)
## Date brz chn ind rus zar
## "Date" "numeric" "numeric" "numeric" "numeric" "numeric"
# The base function class describes what an object is. E.g. class(10) and class("Text")
# Let's build this check into our function, while also rounding results to the nearest third decimal:
<- function(Column){
my_safer_std
if( all( class(Column) %in% "numeric") ) {
= Column / lag(Column) - 1
Column <- mean( Column, na.rm = T) / sd( Column, na.rm = T )
Result <- round( as.numeric(Result), 3)
Result else {
} <- "Not numeric column soz."
Result
}
}
lapply(df_tri, my_safer_std) # Gives list result
## $Date
## [1] "Not numeric column soz."
##
## $brz
## [1] NaN
##
## $chn
## [1] NaN
##
## $ind
## [1] NaN
##
## $rus
## [1] NaN
##
## $zar
## [1] NaN
sapply(df_tri, my_safer_std) # Gives unlisted result (vector)
## Date brz chn
## "Not numeric column soz." "NaN" "NaN"
## ind rus zar
## "NaN" "NaN" "NaN"
Notice for the above, the result for sapply (vector version of lapply) is a vector with both characters and numeric. R produces a character vector as a result - cannot handle both.
This would be akin to trying:
<- c( 10, 12, "Hello", 18)
mixed_vector # Note by adding hello, everything is made character.
You can nest functions (i.e. have functions in functions) - but be careful of this. The following works, e.g., as each function creates a function, f, which works inside out (as you can see internally):
<- function(x) {
f <- function(x) {
f <- function(x) {
f ^ 2 # first
x
}f(x) + 1 # second
}f(x) * 2 # third
}f(10)
## [1] 202
The above works, but is extremely poor coding…
You should always strive to explicitly define and label your functions appropriately so as to avoid confusion.
Functions can ofcourse also be applied in loops (please note loops are old-school. You could often use one of the apply functions rather…)
Let’s e.g. loop through a list and replace all values above 1 with 1000:
# Create open list:
<- list()
df_List $Name <- "Some name"
df_List$x <- rnorm(1000)
df_List$y <- rnorm(1000)
df_List$z <- rnorm(1000)
df_List
<- function(dfl, cap_thresh = 1, replacer = 1000){
capper
if( class( dfl ) == "character" ) return(dfl)
which ( dfl > cap_thresh)] <- replacer
dfl[
dfl
}
# Let's now create a loop for all the list entries:
for(i in 1: length(df_List)){
<- capper(df_List[[i]])
df_List[[i]]
}
# Incidentally We could use lapply below - which applies the function, capper, to each drawer in my list cupboard...
<- lapply(df_List, capper) df_List
Notice above the loop adjusts the df_List object directly - a loop is not a functional environment!
This follows as loops merely repeat a command, whereas a function has a defined input and output.
Ideally, repeating a function many times should not affect the outcome, i.e. Result <- foo(df = X) should always have the same answer. Even if you repeat it a hundred times.
Simply put, if you ask me if I like to braai a 1000 times, my answer should be yes 1000 times.
Ideally you should build your functions & ensure they are sanitized (knowing exactly what goes in and out, with some nice descriptions too), after which they are saved with an informative name.
They can then be sourced at anytime (even in functions) and used in various places in your project.
Let’s take a safe and very generic return calculator function, which first identifies columns that are numeric - and then calculates the returns for the column.
::p_load(dplyr)
pacman
# First note, using sapply we can apply a function on all the columns of a dataframe:
sapply(df, class)
## x df1 df2 ncp log
## "name" "name" "name" "name" "logical" "{"
# We will now use this in our function to find the columns that are numeric:
<- function(df){
Return_Calculator
# Let's create a vector of names of numeric columns:
<-
Return_Columns names( which( sapply(df, class) == "numeric") )
if( length(Return_Columns) == 0 ) stop("No numeric columns! Try another dataframe....")
# names is the vector equivalent of colnames used before...
# Now we can create a function, in a function, and immediately use it:
<- function(column){
Return_Creator / lag(column) - 1
column
}
#Right, let's use this function now:
<- apply( df[, which(colnames(df) %in% Return_Columns)], 2, Return_Creator)
Returns_df
# Let's now append this back to other dataframe information
<-
df_done bind_cols(df[, which(!colnames(df) %in% Return_Columns)],
as_tibble(Returns_df))
# Ensure similar ordering for completeness:
colnames(df)]
df_done[,
}
# Create the data to use in function:
<- fmxdat::BRICSTRI
df_tri # Let's add a column of text to try and break Returns_Column:
<- bind_cols( df_tri, Text_Column = rep( "Some Random TEXT!", nrow(df_tri)) )
df_tri
# Let's now test this bugger:
Return_Calculator(df = df_tri)
## # A tibble: 790 x 7
## Date brz chn ind rus zar Text_Column
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2000-01-14 NA NA NA NA NA Some Random TEXT!
## 2 2000-01-21 -0.0267 -0.0172 0.0232 -0.0263 -0.0107 Some Random TEXT!
## 3 2000-01-28 -0.0279 -0.00169 0.0427 -0.106 -0.101 Some Random TEXT!
## 4 2000-02-04 0.0615 -0.0158 0.0561 -0.0305 0.0396 Some Random TEXT!
## 5 2000-02-11 -0.00709 -0.102 0.111 0.0404 0.00702 Some Random TEXT!
## 6 2000-02-18 -0.0315 0.0428 0.107 -0.0450 -0.0178 Some Random TEXT!
## 7 2000-02-25 0.00960 -0.105 -0.0858 -0.0261 -0.0300 Some Random TEXT!
## 8 2000-03-03 0.0392 -0.0420 -0.0256 0.114 -0.0576 Some Random TEXT!
## 9 2000-03-10 -0.0107 0.0455 -0.0248 0.185 0.0279 Some Random TEXT!
## 10 2000-03-17 -0.0289 0.00407 -0.107 -0.0339 -0.00839 Some Random TEXT!
## # ... with 780 more rows
# Winning!
Obviously the above was much more elaborate than needed (intentionally so), but it illustrates quite a few nice tricks to make your function robust.
Before proceeding, let’s build a structured framework that you must get used to following.
As stressed, the README in your project should be a diary as you progress through a project, and a manual once you are done. Be thorough in documenting your readme.
All R codes should be saved in code, with source scripts loaded as shown in the README.
VERY IMPORTANT: the code folder should only be for fully contained functions.
This means, a script like the following does not belong in code:
#----------------------------------------
# script content:
<- read_rds("data/Some_Data.rds")
df
<- Some_Foo(df)
XX
Some_Other_Foo(Y = df, X = rnorm(100))
write_csv(XX, "C:/Somewhere/Something.csv")
#----------------------------------------
Can you understand why the above should not be saved and called in an R script in ‘code?’
Where would this (or a similar sequential execution of code) belong? The README of course!
Sourcing the above would execute everything in script into your environment, as it is not wrapped in a function(){ }
blanket.
The following should also never be done (where there is uncommented code outside your function, as this will be executed if you source the function):
#----------------------------------------
# script content:
<- 10
x
<- 35
y
<- function(X){
some_foo
<- X^2
result
}
#----------------------------------------
Put all your enclosed functions in the ‘code’ folder
Use fmxdat::source_all(“code”) to load in all your functions into the README (or point to another location to do load other scripts).
The source_all function will only load scripts ending with “.R.” Again, make sure these files are only enclosed functions!
I want you to now practice creating your own function, while explaining your process in your README.
I want you to do the following:
In a different chunk in your README:
Load BRICS.rds, and save this as ‘df_Brics’ in your environment.
Next, create a new chunk where you source a function, ‘filter_df.R,’ that does the following:
filter your dataframe to only consider weekdays between two dates (set by default to 2008 and 2010).
Hint. use: rmsfuns::dateconverter and filter the rows using ‘which’;
For now still, use base R coding (apply family and square bracket truncation).
Create and source a function that calculates the simple returns \(\frac{X}{lag(X)}-1\) for all the columns in your dataframe.
Now that you’ve seen how to construct and source functions, let’s discuss functional programming in R a bit more.
In R, functions are first class elements - implying You can do anything with functions that you can do with vectors: you can assign them to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and even return them as the result of a function.
You have seen this perhaps without fully understanding that you are using functions as inputs - in the apply family.
Notice that these are called functionals, as they require a function as an input (third agrument in apply, e.g.).
Before we construct a functional, let’s see another syntax called ellipse.
Ellipses allow us to keep the door open for additional parameters to be input into functions that are used within our function.
As an example, suppose we create a function, Return_Calc, where we use within our function PerformanceAnalytics package’s PerformanceAnalytics::Return.annualized function, which has many potential input parameters.
I can now either specify all the parameters for PerformanceAnalytics::Return.annualized used in my Return_Calc function, or simply use an ellipse as a placeholder for specifying additional paramteres.
Conceptual Example:
::p_load(PerformanceAnalytics)
pacman
<- fmxdat::BRICSTRI
dfuse
<- function(dfuse, ...){
Return_Calc
# ...pretend some data wrangling happens here...
<- wrangle_foo (dfuse)
wrangled_foo
# ...pretend some data wrangling happens here...
<- PerformanceAnalytics::Return.annualized( wrangled_foo, ...)
result
}
# This now enables me to add any parameter to my function that will ultimately be considered in PerformanceAnalytics::Return.annualized that is in my function,
# e.g. the parameter `geometric = TRUE` below is passed to the Return.annualized function inside (despite not being explicitly
# specified in Return_Calc at the top, but indirectly through channeling the ... part to the function)
<- Return_Calc(dfuse, geometric = TRUE) Returns
Let’s construct a more elaborate functional to explain the concept further:
# Let's first create a quick (and elaborate) dataframe with a column checking if the
# rounded sum of the rows (i.e. integer) is even or uneven ( this is completely arbitrary,
# but allows us to practice our base R skills a bit...).
# Notice here I introduce 'strsplit'- which allows splitting up of character vectors.
# So we first make that value a character, split its string - consider the last digit and
# check if even, before making it a numeric again.
# Of course, you could've used e.g. X / 2 and tested that the value is rounded - but below is more elaborate for practice.
<- function(Row){
Even_Uneven_Sum
ifelse( as.numeric( last( strsplit( as.character(round(Row, 0)), "")[[1]])) %in%
c(0,2,4,6,8,10) , "Even", "Uneven")
}
<-
df_to_Use
bind_cols( fmxdat::SectorTRIs,
tibble(Info = apply(fmxdat::SectorTRIs[,-1], 1, Even_Uneven_Sum))
)
# Now, let's create a functional that applies a function everywhere where we have either Uneven or Even in Info column:
<- function(X, Scalar_Choice){
Max_Foo
max( X, na.rm=T ) * Scalar_Choice
}
<- function(X, Number_Multiplied, Check_Positive_Neg){
Min_Foo
<- min( X, na.rm=T ) * Number_Multiplied
Result
if( Check_Positive_Neg ){
<- ifelse( Result * rnorm(1) > 0, paste0("Positive: ", Result), paste0("Negative: ", Result) )
Final_Result
else {
}
<- Result
Final_Result
}
Final_Result
}
<- function(df_to_Use, Function_To_Apply, ...){
Apply_Numerics_Function
<- colnames( df_to_Use[, sapply(df_to_Use, is.numeric)] )
Numeric_Cols
apply( df_to_Use[ which(df_to_Use$Info == "Even"), Numeric_Cols], 1, Function_To_Apply, ...)
<-
df_Applied tbl_df(
cbind(
which(df_to_Use$Info == "Even"), which(!colnames(df_to_Use) %in% Numeric_Cols)],
df_to_Use[ Even_Foo_Applied = apply( df_to_Use[ which(df_to_Use$Info == "Even"), Numeric_Cols], 1, Function_To_Apply, ...)
# Notice use of ..., or elipse. This is a placeholder for function inputs.
)
)
df_Applied
}
# Max function applied:
Apply_Numerics_Function(df_to_Use = df_to_Use,
Function_To_Apply = Max_Foo,
Scalar_Choice = 1000)
# Notice how the elipse could now contain a multitude of possible function inputs below:
# Min function applied with positive / negative transform:
Apply_Numerics_Function(df_to_Use = df_to_Use,
Function_To_Apply = Min_Foo,
Number_Multiplied = 10,
Check_Positive_Neg = TRUE
)
Apply_Numerics_Function(df_to_Use = df_to_Use,
Function_To_Apply = Min_Foo,
Number_Multiplied = 20,
Check_Positive_Neg = FALSE
)
Although the above examples are elaborate (ask not why, but rather how please :) ) they show something really useful:
Ellipses can be used as placeholders for potential inputs
Functions can be passed as arguments to other functions.
The above is how ‘apply’ (and other similar variants) work - they allow the provision of additional arguments in ellipses, and functions as inputs.
Please go through the above chunks and understand the different forms that functions could take in R.
Later sections will explore uses of functions in more advanced settings, so please be comfortable with these base examples.
We next explore the revolution in coding called tidy coding.
Suggested solution (note, you have to save each step as a separate function and source it all in a neat README. Try this yourself):
# Filter function:
<- function(StartDate = as.Date("2008-01-01"),
filter_df EndDate = as.Date("2010-01-01"),
df_Use){
<- rmsfuns::dateconverter(StartDate = StartDate,
days_filter EndDate = EndDate,
Transform = "weekdays")
<- df_Use[ which( df_Use$Date %in% days_filter), ]
df_Trimmed
df_Trimmed
}
<-
Trimmed_df
filter_df(StartDate = as.Date("2008-01-01"),
EndDate = as.Date("2010-01-01"),
df_Use = fmxdat::BRICSTRI)
# Returns function:
<- function(Trimmed_df){
Return_Foo
<- function(R){
Aux_Return_Func
/ lag(R) - 1
R
}
<-
df_Returns
cbind( Trimmed_df[,1],
apply(Trimmed_df[,-1], 2, Aux_Return_Func)
)
df_Returns
}
<- Return_Foo(Trimmed_df) Result
From next time, the notation will be much easier… almost there!