Objectives

  • R overview
  • RStudio Environment
  • R as a calculator
  • Creating variables
  • Intro to vectors
  • Intro to dataframes
  • Intro to graphics
  • Structured text files, delimiters
  • Importing a CSV with base R
  • File paths in R and your working directory
  • Introducing the readr package
  • Reading Excel Files

Why R?

  • ‘R is a system for statistical computation and graphics.’ From the R FAQ
  • Open source (free!)
  • Works on Mac/Linux/Windows
  • Comprehensive Statistical/Analytics Platform
  • Most popular open source software for statistics (or maybe Python)
  • New things implemented quickly.

Read Chapter 2 of R Programming for Data Science for more details on the history of R.

Download and Install The R project for Statistical Computing

  • Download R at URL www.r-project.org
  • Install R if you have not already done so.
  • After installing R, launch it!

Read Chapter 3 of R Programming for Data Science for more details on installation.

Writing Basic R Code

Let’s write some simple code in R, and learn how to execute it.

## Basic Addition
3 + 4

## Function
log(42)

## Save the result as a variable
x1 <- 3 + 4
## Print the result
x1

Thought on Formatting Your Code

WhyCan’tIWriteAllofMySentencesLikeThis? ItSeemsInefficientToUseSpaces!

  • Grammar Rules for the computer
    • The R language expects very specific syntax
    • Deviation from that syntax will give errors
  • Correct Syntax and Readable Code
    • We can have correct syntax that doesn’t give errors
    • But it is a nightmare for humans to read
  • Style Guides
    • Guidelines for formatting code
    • Poorly formatted but syntactically correct code will be evaluated by a computer without issue. However, humans maintaining the code will not find the work so easy.

R Help?

  • There is built in documentation for R available, you can open using the following:
?log

RStudio - Fancy Front End for R

  • R is a command line program. Think of it as the analytic engine.
  • But the Graphical User Interface (GUI) for R is pretty limited…
  • RStudio is a popular GUI to use for R. Think of RStudio as a fancy steering wheel for driving R, which still uses R as the engine. A steering wheel with no engine isn’t useful…
  • The homepage for RStudio is here. Download and install RStudio if you have not already done so.
  • We will use RStudio from now on, but always remember you can use regular R without RStudio! Occasionally there may be an issue caused by RStudio. Running your code using ‘regular’ R is an option.

EX2. Math with RStudio

  • Open R studio
  • Save this as a new script and put a header on it
  • Establish a reasonable file directory for this training
  • Let’s explore the options …
  • Using R as a nifty calculator
# Simple sums 
4 + 9 
## [1] 13
# products 
4 * 9 
## [1] 36

Examples

# log is the natural log (ln) 
log(1000) 
## [1] 6.907755
# Use log10 or log(x, base = 10) to get base 10 
log(1000, base = 10) 
## [1] 3
log10(1000) 
## [1] 3

EX3. More Math

Question: What is \(log_{4}\) of 42?

EX3. More Math

Question: What is \(log_{4}\) of 42?

log(42, base = 4) 
## [1] 2.696159

Question: What is \(\left(\frac{3 \sin\left( \frac{\pi}{6} \right) + 32}{\sqrt{23}} + 2\right) - 0.9\)

  • Hint: sin(), pi, sqrt()

EX3. More Math

Question: What is \(log_{4}\) of 42?

log(42, base = 4) 
## [1] 2.696159

Question: What is \(\left(\frac{3 \sin\left( \frac{\pi}{6} \right) + 32}{\sqrt{23}} + 2\right) - 0.9\)

  • Hint: sin(), pi, sqrt()
((3 * sin(pi / 6) + 32) / sqrt(23) + 2) - 0.9 
## [1] 8.085233
  • Order of Operations will be respected

Assigning Values to Variables

  • What if we want to keep a value for later use?
  • R has five assignment operators: <-, =, ->, <<-, ->>
  • In the Console window type ?'<-'
  • Avoid <<- and ->> unless you are absolutely sure such an assignment is needed (more on this later in the course.)
  • Do no intermix left and right assignment operators
  • <- is the most commonly used assignment operator
  • Preferred by Google Style Guide
  • Psychological implied meaning and difference between <- and =:
    • = defines values of a function’s argument
    • <- defines an object

Body Mass Index

Question: What is the BMI of someone who is 6 ft 2 inches and weighs 210 pounds?

weight   <- 210 
height   <- 74 
constant <- 703 
bmi      <- (weight * constant)/height ^ 2 

BMI

  • Where’s the answer?
  • By default, no output to the console is provided when assignments are made. You may request the output to typing the object’s name in the console.
  • Answer
bmi 
## [1] 26.95946
  • R is not verbose

BMI extended

Question: What is the BMI of this person if they lost 20 pounds?

weight <- weight - 20 
bmi 
## [1] 26.95946
# What happened? 
bmi      <- (weight * constant)/height ^ 2 
bmi 
## [1] 24.39189
  • What does this imply about R variables?
  • If changes are made to intermediate variables, they must be recalculated

What variables, functions, are in my workspace?

# BMI will generate an error
BMI
## Error in eval(expr, envir, enclos): object 'BMI' not found
# What happened? 
bmi
## [1] 24.39189
ls() 
## [1] "bmi"      "constant" "height"   "weight"

Removing Objects from the Workspace

ls() 
## [1] "bmi"      "constant" "height"   "weight"
bmi 
## [1] 24.39189
rm("bmi") 
ls()  
## [1] "constant" "height"   "weight"
#bmi #ask for it, it will no longer be there and generate an error

Removing All Objects from the Workspace

ls() 
## [1] "constant" "height"   "weight"
# Etch-A-Sketch End of The World!  
# Clear the whole workspace! 
rm(list = ls()) 
ls() 
## character(0)

Writing a simple script

  • Open RStudio and a new script
  • Let’s get into the habit in creating a meaningful header for scripts
  • We can work with more than a single value at a time
  • help('c')
  • Let’s look at a simple example of height and weight data
# data input 
height <- c( 60,  62,  61,  65) 
 
weight <- c(135, 155, 145, 155)   
 
sex    <- c("f", "m", "f", "f") 

Use script to find BMI

  • Find the body mass index BMI for each subject
# bmi = weight (lb) per height (inches) squared 
constant <- 703 
bmi      <- (weight  * constant )/ height^2 
bmi
## [1] 26.36250 28.34677 27.39452 25.79053
  • Note the differences in the length of the vector and the constant
length(bmi) 
## [1] 4
length(constant) 
## [1] 1
  • R recycled the constant term in the vector operation

Some simple summary statistics

There are simple summary statistics available (and many more complex ones!)

mean(weight) 
## [1] 147.5
median(height) 
## [1] 61.5
quantile(bmi) 
##       0%      25%      50%      75%     100% 
## 25.79053 26.21951 26.87851 27.63258 28.34677

Object Types

  • There are many types of objects and many data structures
  • These concepts will be explored in more detail soon
mean(sex) 
## [1] NA
# find the mode of the objects 
mode(bmi) 
## [1] "numeric"
mode(sex) 
## [1] "character"

Subsetting / Indexing A Vector

  • What is the height of the 3rd subject
height[3]  
## [1] 61
  • What is the bmi of the 1st 2nd and 3rd subjects
bmi[c(1,2,3)] 
## [1] 26.36250 28.34677 27.39452
bmi[1:3] 
## [1] 26.36250 28.34677 27.39452

Logical Vector

  • Indexing by logical vectors is possible too
males <- sex == "m" 
males 
## [1] FALSE  TRUE FALSE FALSE
height[males] 
## [1] 62
# can combine the work into one line, e.g., height of females: 
height[sex == "f"] 
## [1] 60 61 65

EX4. Using logical indexing

Question: What is the bmi of males over 5’1”?

# help("&") 
idx <- sex == "m" & height > 61 
bmi[idx] 
## [1] 28.34677

Question: What is the bmi of patients that are female or under 5’3”? hint: use | to represent or

EX4. Using logical indexing

Question: What is the bmi of males over 5’1”?

# help("&") 
idx <- sex == "m" & height > 61 
bmi[idx] 
## [1] 28.34677

Question: What is the bmi of patients that are female or under 5’3”? hint: use | to represent or

idx <- sex == "f" | height < 63 
bmi[idx] 
## [1] 26.36250 28.34677 27.39452 25.79053

Simple Scatterplot

Basic graphs are quickly available:

plot(height, weight) 

EX5. Make Simple Scatterplot

Question: Make a plot of BMI versus height

plot(bmi, height) 

Introduction to Data Frames

  • The primary data analysis object in R is the dataframe
  • A data frame is a list of vectors columns
  • You might think of this like an Excel spreadsheet

Data Frame:

Height Weight Sex
70 170 m
55 150 f
66 160 m

Introduction to Dataframes

  • The primary data analysis object in R is the data frame
  • A data frame is a list of vectors (columns)
  • You may think of this like an Excel spreadsheet for now
  • We can touch the columns of a dataframe using the dollar sign $
  • The columns are now locked together into one object
my_df <- data.frame(weight, height) 
my_df
##   weight height
## 1    135     60
## 2    155     62
## 3    145     61
## 4    155     65
# Calculate the mean height 
mean(my_df$height) 
## [1] 62

Dataframes 2

  • You can use tab completion in Rstudio following $

Question: Calculate the mean weight

# Calculate the mean weight 
mean(my_df$weight) 
## [1] 147.5

Adding a New Variable $

  • We can add a new variable by simple assignment using the $
  • Here is an example of adding gender
my_df$gender <- sex 

Question: Can you figure out how to create a BMI variable (redo the calculation using $)?

my_df$bmi <- (my_df$weight * 703)/my_df$height ^ 2 

Save the script

  • Save the 01 _Introduction.R file
  • Close up RStudio
  • Let’s discuss the options for saving when closing R
  • Reopen R studio inspect the workspace
  • Rerun your script

Extending R

  • As powerful as R is the base install doesn’t have everything
  • Additional functionality can be found in packages
  • See CRAN
  • To install a package: (This only needs to be done once per R major version)
  • When you see library(xxxx) you should first install the package xxxx
# library(ggplot2)
# install.packages("ggplot2")

Loading packages

  • Load the extra code per session
# quick graph  
library(ggplot2) 
qplot(height,weight)  

Structured Text Files

A common method of storing data is to use delimited text file. Delimited text files store data one row at a time, with each column separate by a ‘delimiter’. One example is a comma separated values file, often referred to as a CSV. In a CSV file, the delimiter is a comma. Usually the first row of the text file has the names of each column, and subsequent rows of text contain the actual data. A table of data with this information:

Would look like this as a csv file:

## create data frame
df1 <- 
  data.frame(age=c(56,24,35,42),
             gender=c('M','M','F','M'),
             salary=c('35k','35k','50k','25k'))
## write out dataframe
write.csv(df1,
          file = "test1.csv",
          row.names = F) # don't include row names
## Check where the file went!
getwd()
[1] "C:/Users/evancarey/Dropbox/Work/BHAnalytics/courseware/PowerBI/PowerBI_R/PredAnalyticsWebinar/r_scripts"

Most statistical programs can output data as a csv. This is one of the most common file formats for passing data between programs (like from SQL to R, or from SAS to R, etc). The main advantage is any program can read the data. The main disadvantage is there is no ‘metadata’ indicating what each column type should be. Programs must guess, or the programmer must specify.

There are other options for delimiters you may encounter. We could use anything other than a comma. Sometimes files may be pipe delimited:

write.table(df1,
            file = "test2.csv",
            row.names = F,
            sep = "|")

And the variable names may or may not be present!

When you receive a file of data, you should always first open it with a text editor (not import it to R) and verify the delimiter, as well as if the first row is variable names or data.

Importing Data Using Base R

  • So how do we import data from a comma delimited file into R? We can simply use the read.csv() function, and the appropriate arguments.
  • Remember to first examine the data with a text editor, then try to import into R.
  • You will need to tell R where the file is on your file system by using a filepath. You cannot use single backslashes (like normal windows paths), because the backslash is a special command in R (the escape command). Blame Microsoft, this is only an issue in Windows…
  • You can either use double backslashes (which escapes the escape, so it works…), or just use forward slashes. Here is an example of where this doesn’t work:
# Made up file paths:
# doesn't work, would give error
"c:\users\evan\file1.txt"
# this works
"c:\\users\\evan\\file1.txt"
# this works
"c:/users/evan/file1.txt"
## Error: '\u' used without hex digits in character string starting ""c:\u"
  • \u is actually a special command for unicode. There are other special commands, like \t for tab, or \n for newline. Here I demonstrate them using the cat() function, which just sends (prints) the text to the console:
# print hello world
cat('Hello world!')
## Hello world!
# print some text with special commands
cat('Here is a tab\tdid you see it?')
## Here is a tab    did you see it?
cat('Here is a newline\ndid you see it?')
## Here is a newline
## did you see it?
# Import csv into R using base functions
# You must alter this path to point at wherever you have downloaded the file on your system!
df_sales <- 
  read.csv('../Data/customer_sales1.csv')
# Examine the file after you import (as always!)
head(df_sales)
##   customer_id sale sale_amount transaction_Date region age gender activity
## 1      100000  Yes    309.2983       2017-07-02     33  67 Female      Med
## 2      100001   No      0.0000       2019-10-08     15  64   Male      Low
## 3      100003   No      0.0000       2018-05-07      8  63   Male      Med
## 4      100005   No      0.0000       2017-04-27     30  53   Male      Low
## 5      100006   No      0.0000       2016-06-06     46  62 Female      Med
## 6      100007  Yes    355.9759       2021-06-24     27  39   Male      Med
##   marketing_exposure       x1        x2        x3 x4 x5 num_accounts
## 1                  7 38.62002 10.806497  87.31335 11  3            0
## 2                  5 41.66519 11.187090  97.62174 12  2            1
## 3                  6 56.00695  8.179591 105.76904 10  0            0
## 4                  8 48.47624  8.107695  88.43920 12  2            0
## 5                  3 52.59204  9.640424  98.27697  8  1            0
## 6                  5 59.10474  8.553198 113.39528 15  3            1
##   current_customer income
## 1               No     58
## 2               No     75
## 3               No     44
## 4               No     54
## 5               No     64
## 6               No     62
str(df_sales)
## 'data.frame':    8000 obs. of  17 variables:
##  $ customer_id       : int  100000 100001 100003 100005 100006 100007 100008 100009 100010 100011 ...
##  $ sale              : chr  "Yes" "No" "No" "No" ...
##  $ sale_amount       : num  309 0 0 0 0 ...
##  $ transaction_Date  : chr  "2017-07-02" "2019-10-08" "2018-05-07" "2017-04-27" ...
##  $ region            : int  33 15 8 30 46 27 20 17 42 3 ...
##  $ age               : int  67 64 63 53 62 39 62 48 63 65 ...
##  $ gender            : chr  "Female" "Male" "Male" "Male" ...
##  $ activity          : chr  "Med" "Low" "Med" "Low" ...
##  $ marketing_exposure: int  7 5 6 8 3 5 4 11 1 4 ...
##  $ x1                : num  38.6 41.7 56 48.5 52.6 ...
##  $ x2                : num  10.81 11.19 8.18 8.11 9.64 ...
##  $ x3                : num  87.3 97.6 105.8 88.4 98.3 ...
##  $ x4                : int  11 12 10 12 8 15 10 7 10 16 ...
##  $ x5                : int  3 2 0 2 1 3 0 1 3 5 ...
##  $ num_accounts      : int  0 1 0 0 0 1 0 0 0 0 ...
##  $ current_customer  : chr  "No" "No" "No" "No" ...
##  $ income            : int  58 75 44 54 64 62 54 73 64 78 ...
  • If you are importing data that has some other delimiter, you will need to use read.table()
  • For example, here is a file with the pipe delimiter (which you would discover after you open the examine the file with a text editor).
  • Notice this file did not have any variable names! We have to add them afterwards then.
# import pipe delimited file
df2 <- 
  read.table(file='../data/chickweight.txt',
             sep = '|')
head(df2)
##   V1 V2 V3 V4
## 1 42  0  1  1
## 2 51  2  1  1
## 3 59  4  1  1
## 4 64  6  1  1
## 5 76  8  1  1
## 6 93 10  1  1
str(df2)
## 'data.frame':    578 obs. of  4 variables:
##  $ V1: int  42 51 59 64 76 93 106 125 149 171 ...
##  $ V2: int  0 2 4 6 8 10 12 14 16 18 ...
##  $ V3: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V4: int  1 1 1 1 1 1 1 1 1 1 ...
# Add names - we would need to know these from some other source. 
names(df2) <- 
  c("weight", "Time", "Chick", "Diet")

Shortcoming of the Base Import Functions

  • You can use those base functions to import data. However, R has to try and guess what each column should be (unless you specify it). Also, by default, R converts all the characters into factors (which some people don’t like). Finally, it can be slow on larger data.

  • For those reasons, there are a few different packages implemented to make reading data bit easier. We will focus on the readr package, which is part of the tidyverse collection of packages. We will spend a bit of time going over all the packages in the tidyverse group in subsequent lectures.

  • For these simple cases, we can just use read_csv() to replace read.csv(), and read_table() to replace read.table().

  • The main difference you will notice is:

    • Character values remain character instead of factors
    • The print method for the dataframe is nicer (it won’t print the whole dataframe and overwhelm your screen). This is because the return object is actually a tbl (pronounced tibble), which just a fancy version of a dataframe.
# using readr
library(readr)
# Import csv into R using readr functions
# You must alter this path to point at wherever you have downloaded the file on your system!
df_sales <- 
  read_csv('../data/customer_sales1.csv')
# Examine the file after you import (as always!)
df_sales
## # A tibble: 8,000 x 17
##    customer_id sale  sale_amount transaction_Date region   age gender activity
##          <dbl> <chr>       <dbl> <date>            <dbl> <dbl> <chr>  <chr>   
##  1      100000 Yes         309.  2017-07-02           33    67 Female Med     
##  2      100001 No            0   2019-10-08           15    64 Male   Low     
##  3      100003 No            0   2018-05-07            8    63 Male   Med     
##  4      100005 No            0   2017-04-27           30    53 Male   Low     
##  5      100006 No            0   2016-06-06           46    62 Female Med     
##  6      100007 Yes         356.  2021-06-24           27    39 Male   Med     
##  7      100008 Yes         545.  2020-08-14           20    62 Female Low     
##  8      100009 Yes         238.  2021-05-29           17    48 Male   Low     
##  9      100010 No            0   2019-12-28           42    63 Female Med     
## 10      100011 Yes          21.5 2018-03-31            3    65 Male   Med     
## # ... with 7,990 more rows, and 9 more variables: marketing_exposure <dbl>,
## #   x1 <dbl>, x2 <dbl>, x3 <dbl>, x4 <dbl>, x5 <dbl>, num_accounts <dbl>,
## #   current_customer <chr>, income <dbl>
str(df_sales)
## spec_tbl_df [8,000 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ customer_id       : num [1:8000] 1e+05 1e+05 1e+05 1e+05 1e+05 ...
##  $ sale              : chr [1:8000] "Yes" "No" "No" "No" ...
##  $ sale_amount       : num [1:8000] 309 0 0 0 0 ...
##  $ transaction_Date  : Date[1:8000], format: "2017-07-02" "2019-10-08" ...
##  $ region            : num [1:8000] 33 15 8 30 46 27 20 17 42 3 ...
##  $ age               : num [1:8000] 67 64 63 53 62 39 62 48 63 65 ...
##  $ gender            : chr [1:8000] "Female" "Male" "Male" "Male" ...
##  $ activity          : chr [1:8000] "Med" "Low" "Med" "Low" ...
##  $ marketing_exposure: num [1:8000] 7 5 6 8 3 5 4 11 1 4 ...
##  $ x1                : num [1:8000] 38.6 41.7 56 48.5 52.6 ...
##  $ x2                : num [1:8000] 10.81 11.19 8.18 8.11 9.64 ...
##  $ x3                : num [1:8000] 87.3 97.6 105.8 88.4 98.3 ...
##  $ x4                : num [1:8000] 11 12 10 12 8 15 10 7 10 16 ...
##  $ x5                : num [1:8000] 3 2 0 2 1 3 0 1 3 5 ...
##  $ num_accounts      : num [1:8000] 0 1 0 0 0 1 0 0 0 0 ...
##  $ current_customer  : chr [1:8000] "No" "No" "No" "No" ...
##  $ income            : num [1:8000] 58 75 44 54 64 62 54 73 64 78 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   customer_id = col_double(),
##   ..   sale = col_character(),
##   ..   sale_amount = col_double(),
##   ..   transaction_Date = col_date(format = ""),
##   ..   region = col_double(),
##   ..   age = col_double(),
##   ..   gender = col_character(),
##   ..   activity = col_character(),
##   ..   marketing_exposure = col_double(),
##   ..   x1 = col_double(),
##   ..   x2 = col_double(),
##   ..   x3 = col_double(),
##   ..   x4 = col_double(),
##   ..   x5 = col_double(),
##   ..   num_accounts = col_double(),
##   ..   current_customer = col_character(),
##   ..   income = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
class(df_sales)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
# import pipe delimited file
df2 <- 
  read_delim(file='../data/chickweight.txt',
            delim = '|')
df2
## # A tibble: 577 x 4
##     `42`   `0` `1...3` `1...4`
##    <dbl> <dbl>   <dbl>   <dbl>
##  1    51     2       1       1
##  2    59     4       1       1
##  3    64     6       1       1
##  4    76     8       1       1
##  5    93    10       1       1
##  6   106    12       1       1
##  7   125    14       1       1
##  8   149    16       1       1
##  9   171    18       1       1
## 10   199    20       1       1
## # ... with 567 more rows

Perform some of those data manipulations from PBI:

  • replace med with medium
  • add a year column
  • Add the conditional income category
head(df_sales)
## # A tibble: 6 x 17
##   customer_id sale  sale_amount transaction_Date region   age gender activity
##         <dbl> <chr>       <dbl> <date>            <dbl> <dbl> <chr>  <chr>   
## 1      100000 Yes          309. 2017-07-02           33    67 Female Med     
## 2      100001 No             0  2019-10-08           15    64 Male   Low     
## 3      100003 No             0  2018-05-07            8    63 Male   Med     
## 4      100005 No             0  2017-04-27           30    53 Male   Low     
## 5      100006 No             0  2016-06-06           46    62 Female Med     
## 6      100007 Yes          356. 2021-06-24           27    39 Male   Med     
## # ... with 9 more variables: marketing_exposure <dbl>, x1 <dbl>, x2 <dbl>,
## #   x3 <dbl>, x4 <dbl>, x5 <dbl>, num_accounts <dbl>, current_customer <chr>,
## #   income <dbl>
# check unique values
unique(df_sales$activity)
## [1] "Med"  "Low"  "High"
df_sales$activity[df_sales$activity=='Med'] <- 
  'Medium'

We can add the transaction year like so:

library(lubridate)
df_sales$transaction_year <- 
  year(df_sales$transaction_Date)

And lets explore the income category:

# examine income 
summary(df_sales$income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.00   58.00   65.00   65.07   72.00  103.00
# create bins:
df_sales$income_cat <- 
  cut(df_sales$income,
      breaks = c(0,60,70,105),
      labels = c('low','medium','high'))

Conclusion

We have covered a lot of ground of base R functionality! Does anyone have any questions about what we covered so far ?

let’s step back into PowerBI and show how to use R as a data source, building off the script we just created.