Read Chapter 2 of R Programming for Data Science for more details on the history of R.
Read Chapter 3 of R Programming for Data Science for more details on installation.
Let’s write some simple code in R, and learn how to execute it.
## Basic Addition
3 + 4
## Function
log(42)
## Save the result as a variable
x1 <- 3 + 4
## Print the result
x1
WhyCan’tIWriteAllofMySentencesLikeThis? ItSeemsInefficientToUseSpaces!
?log
# Simple sums
4 + 9
## [1] 13
# products
4 * 9
## [1] 36
# log is the natural log (ln)
log(1000)
## [1] 6.907755
# Use log10 or log(x, base = 10) to get base 10
log(1000, base = 10)
## [1] 3
log10(1000)
## [1] 3
Question: What is \(log_{4}\) of 42?
Question: What is \(log_{4}\) of 42?
log(42, base = 4)
## [1] 2.696159
Question: What is \(\left(\frac{3 \sin\left( \frac{\pi}{6} \right) + 32}{\sqrt{23}} + 2\right) - 0.9\)
sin(), pi, sqrt()Question: What is \(log_{4}\) of 42?
log(42, base = 4)
## [1] 2.696159
Question: What is \(\left(\frac{3 \sin\left( \frac{\pi}{6} \right) + 32}{\sqrt{23}} + 2\right) - 0.9\)
sin(), pi, sqrt()((3 * sin(pi / 6) + 32) / sqrt(23) + 2) - 0.9
## [1] 8.085233
<-, =, ->, <<-, ->>?'<-'<<- and ->> unless you are absolutely sure such an assignment is needed (more on this later in the course.)<- is the most commonly used assignment operator<- and =:
= defines values of a function’s argument<- defines an objectQuestion: What is the BMI of someone who is 6 ft 2 inches and weighs 210 pounds?
weight <- 210
height <- 74
constant <- 703
bmi <- (weight * constant)/height ^ 2
bmi
## [1] 26.95946
Question: What is the BMI of this person if they lost 20 pounds?
weight <- weight - 20
bmi
## [1] 26.95946
# What happened?
bmi <- (weight * constant)/height ^ 2
bmi
## [1] 24.39189
# BMI will generate an error
BMI
## Error in eval(expr, envir, enclos): object 'BMI' not found
# What happened?
bmi
## [1] 24.39189
ls()
## [1] "bmi" "constant" "height" "weight"
ls()
## [1] "bmi" "constant" "height" "weight"
bmi
## [1] 24.39189
rm("bmi")
ls()
## [1] "constant" "height" "weight"
#bmi #ask for it, it will no longer be there and generate an error
ls()
## [1] "constant" "height" "weight"
# Etch-A-Sketch End of The World!
# Clear the whole workspace!
rm(list = ls())
ls()
## character(0)
help('c')# data input
height <- c( 60, 62, 61, 65)
weight <- c(135, 155, 145, 155)
sex <- c("f", "m", "f", "f")
# bmi = weight (lb) per height (inches) squared
constant <- 703
bmi <- (weight * constant )/ height^2
bmi
## [1] 26.36250 28.34677 27.39452 25.79053
length(bmi)
## [1] 4
length(constant)
## [1] 1
There are simple summary statistics available (and many more complex ones!)
mean(weight)
## [1] 147.5
median(height)
## [1] 61.5
quantile(bmi)
## 0% 25% 50% 75% 100%
## 25.79053 26.21951 26.87851 27.63258 28.34677
mean(sex)
## [1] NA
# find the mode of the objects
mode(bmi)
## [1] "numeric"
mode(sex)
## [1] "character"
height[3]
## [1] 61
bmi[c(1,2,3)]
## [1] 26.36250 28.34677 27.39452
bmi[1:3]
## [1] 26.36250 28.34677 27.39452
males <- sex == "m"
males
## [1] FALSE TRUE FALSE FALSE
height[males]
## [1] 62
# can combine the work into one line, e.g., height of females:
height[sex == "f"]
## [1] 60 61 65
Question: What is the bmi of males over 5’1”?
# help("&")
idx <- sex == "m" & height > 61
bmi[idx]
## [1] 28.34677
Question: What is the bmi of patients that are female or under 5’3”? hint: use
|to represent or
Question: What is the bmi of males over 5’1”?
# help("&")
idx <- sex == "m" & height > 61
bmi[idx]
## [1] 28.34677
Question: What is the bmi of patients that are female or under 5’3”? hint: use
|to represent or
idx <- sex == "f" | height < 63
bmi[idx]
## [1] 26.36250 28.34677 27.39452 25.79053
Basic graphs are quickly available:
plot(height, weight)
Question: Make a plot of BMI versus height
plot(bmi, height)
Data Frame:
| Height | Weight | Sex |
|---|---|---|
| 70 | 170 | m |
| 55 | 150 | f |
| 66 | 160 | m |
$my_df <- data.frame(weight, height)
my_df
## weight height
## 1 135 60
## 2 155 62
## 3 145 61
## 4 155 65
# Calculate the mean height
mean(my_df$height)
## [1] 62
$Question: Calculate the mean weight
# Calculate the mean weight
mean(my_df$weight)
## [1] 147.5
$$gendermy_df$gender <- sex
Question: Can you figure out how to create a BMI variable (redo the calculation using
$)?
my_df$bmi <- (my_df$weight * 703)/my_df$height ^ 2
01 _Introduction.R filelibrary(xxxx) you should first install the package xxxx# library(ggplot2)
# install.packages("ggplot2")
# quick graph
library(ggplot2)
qplot(height,weight)
A common method of storing data is to use delimited text file. Delimited text files store data one row at a time, with each column separate by a ‘delimiter’. One example is a comma separated values file, often referred to as a CSV. In a CSV file, the delimiter is a comma. Usually the first row of the text file has the names of each column, and subsequent rows of text contain the actual data. A table of data with this information:
Would look like this as a csv file:
## create data frame
df1 <-
data.frame(age=c(56,24,35,42),
gender=c('M','M','F','M'),
salary=c('35k','35k','50k','25k'))
## write out dataframe
write.csv(df1,
file = "test1.csv",
row.names = F) # don't include row names
## Check where the file went!
getwd()
[1] "C:/Users/evancarey/Dropbox/Work/BHAnalytics/courseware/PowerBI/PowerBI_R/PredAnalyticsWebinar/r_scripts"
Most statistical programs can output data as a csv. This is one of the most common file formats for passing data between programs (like from SQL to R, or from SAS to R, etc). The main advantage is any program can read the data. The main disadvantage is there is no ‘metadata’ indicating what each column type should be. Programs must guess, or the programmer must specify.
There are other options for delimiters you may encounter. We could use anything other than a comma. Sometimes files may be pipe delimited:
write.table(df1,
file = "test2.csv",
row.names = F,
sep = "|")
And the variable names may or may not be present!
When you receive a file of data, you should always first open it with a text editor (not import it to R) and verify the delimiter, as well as if the first row is variable names or data.
read.csv() function, and the appropriate arguments.# Made up file paths:
# doesn't work, would give error
"c:\users\evan\file1.txt"
# this works
"c:\\users\\evan\\file1.txt"
# this works
"c:/users/evan/file1.txt"
## Error: '\u' used without hex digits in character string starting ""c:\u"
\u is actually a special command for unicode. There are other special commands, like \t for tab, or \n for newline. Here I demonstrate them using the cat() function, which just sends (prints) the text to the console:# print hello world
cat('Hello world!')
## Hello world!
# print some text with special commands
cat('Here is a tab\tdid you see it?')
## Here is a tab did you see it?
cat('Here is a newline\ndid you see it?')
## Here is a newline
## did you see it?
# Import csv into R using base functions
# You must alter this path to point at wherever you have downloaded the file on your system!
df_sales <-
read.csv('../Data/customer_sales1.csv')
# Examine the file after you import (as always!)
head(df_sales)
## customer_id sale sale_amount transaction_Date region age gender activity
## 1 100000 Yes 309.2983 2017-07-02 33 67 Female Med
## 2 100001 No 0.0000 2019-10-08 15 64 Male Low
## 3 100003 No 0.0000 2018-05-07 8 63 Male Med
## 4 100005 No 0.0000 2017-04-27 30 53 Male Low
## 5 100006 No 0.0000 2016-06-06 46 62 Female Med
## 6 100007 Yes 355.9759 2021-06-24 27 39 Male Med
## marketing_exposure x1 x2 x3 x4 x5 num_accounts
## 1 7 38.62002 10.806497 87.31335 11 3 0
## 2 5 41.66519 11.187090 97.62174 12 2 1
## 3 6 56.00695 8.179591 105.76904 10 0 0
## 4 8 48.47624 8.107695 88.43920 12 2 0
## 5 3 52.59204 9.640424 98.27697 8 1 0
## 6 5 59.10474 8.553198 113.39528 15 3 1
## current_customer income
## 1 No 58
## 2 No 75
## 3 No 44
## 4 No 54
## 5 No 64
## 6 No 62
str(df_sales)
## 'data.frame': 8000 obs. of 17 variables:
## $ customer_id : int 100000 100001 100003 100005 100006 100007 100008 100009 100010 100011 ...
## $ sale : chr "Yes" "No" "No" "No" ...
## $ sale_amount : num 309 0 0 0 0 ...
## $ transaction_Date : chr "2017-07-02" "2019-10-08" "2018-05-07" "2017-04-27" ...
## $ region : int 33 15 8 30 46 27 20 17 42 3 ...
## $ age : int 67 64 63 53 62 39 62 48 63 65 ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ activity : chr "Med" "Low" "Med" "Low" ...
## $ marketing_exposure: int 7 5 6 8 3 5 4 11 1 4 ...
## $ x1 : num 38.6 41.7 56 48.5 52.6 ...
## $ x2 : num 10.81 11.19 8.18 8.11 9.64 ...
## $ x3 : num 87.3 97.6 105.8 88.4 98.3 ...
## $ x4 : int 11 12 10 12 8 15 10 7 10 16 ...
## $ x5 : int 3 2 0 2 1 3 0 1 3 5 ...
## $ num_accounts : int 0 1 0 0 0 1 0 0 0 0 ...
## $ current_customer : chr "No" "No" "No" "No" ...
## $ income : int 58 75 44 54 64 62 54 73 64 78 ...
read.table()# import pipe delimited file
df2 <-
read.table(file='../data/chickweight.txt',
sep = '|')
head(df2)
## V1 V2 V3 V4
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
str(df2)
## 'data.frame': 578 obs. of 4 variables:
## $ V1: int 42 51 59 64 76 93 106 125 149 171 ...
## $ V2: int 0 2 4 6 8 10 12 14 16 18 ...
## $ V3: int 1 1 1 1 1 1 1 1 1 1 ...
## $ V4: int 1 1 1 1 1 1 1 1 1 1 ...
# Add names - we would need to know these from some other source.
names(df2) <-
c("weight", "Time", "Chick", "Diet")
You can use those base functions to import data. However, R has to try and guess what each column should be (unless you specify it). Also, by default, R converts all the characters into factors (which some people don’t like). Finally, it can be slow on larger data.
For those reasons, there are a few different packages implemented to make reading data bit easier. We will focus on the readr package, which is part of the tidyverse collection of packages. We will spend a bit of time going over all the packages in the tidyverse group in subsequent lectures.
For these simple cases, we can just use read_csv() to replace read.csv(), and read_table() to replace read.table().
The main difference you will notice is:
tbl (pronounced tibble), which just a fancy version of a dataframe.# using readr
library(readr)
# Import csv into R using readr functions
# You must alter this path to point at wherever you have downloaded the file on your system!
df_sales <-
read_csv('../data/customer_sales1.csv')
# Examine the file after you import (as always!)
df_sales
## # A tibble: 8,000 x 17
## customer_id sale sale_amount transaction_Date region age gender activity
## <dbl> <chr> <dbl> <date> <dbl> <dbl> <chr> <chr>
## 1 100000 Yes 309. 2017-07-02 33 67 Female Med
## 2 100001 No 0 2019-10-08 15 64 Male Low
## 3 100003 No 0 2018-05-07 8 63 Male Med
## 4 100005 No 0 2017-04-27 30 53 Male Low
## 5 100006 No 0 2016-06-06 46 62 Female Med
## 6 100007 Yes 356. 2021-06-24 27 39 Male Med
## 7 100008 Yes 545. 2020-08-14 20 62 Female Low
## 8 100009 Yes 238. 2021-05-29 17 48 Male Low
## 9 100010 No 0 2019-12-28 42 63 Female Med
## 10 100011 Yes 21.5 2018-03-31 3 65 Male Med
## # ... with 7,990 more rows, and 9 more variables: marketing_exposure <dbl>,
## # x1 <dbl>, x2 <dbl>, x3 <dbl>, x4 <dbl>, x5 <dbl>, num_accounts <dbl>,
## # current_customer <chr>, income <dbl>
str(df_sales)
## spec_tbl_df [8,000 x 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ customer_id : num [1:8000] 1e+05 1e+05 1e+05 1e+05 1e+05 ...
## $ sale : chr [1:8000] "Yes" "No" "No" "No" ...
## $ sale_amount : num [1:8000] 309 0 0 0 0 ...
## $ transaction_Date : Date[1:8000], format: "2017-07-02" "2019-10-08" ...
## $ region : num [1:8000] 33 15 8 30 46 27 20 17 42 3 ...
## $ age : num [1:8000] 67 64 63 53 62 39 62 48 63 65 ...
## $ gender : chr [1:8000] "Female" "Male" "Male" "Male" ...
## $ activity : chr [1:8000] "Med" "Low" "Med" "Low" ...
## $ marketing_exposure: num [1:8000] 7 5 6 8 3 5 4 11 1 4 ...
## $ x1 : num [1:8000] 38.6 41.7 56 48.5 52.6 ...
## $ x2 : num [1:8000] 10.81 11.19 8.18 8.11 9.64 ...
## $ x3 : num [1:8000] 87.3 97.6 105.8 88.4 98.3 ...
## $ x4 : num [1:8000] 11 12 10 12 8 15 10 7 10 16 ...
## $ x5 : num [1:8000] 3 2 0 2 1 3 0 1 3 5 ...
## $ num_accounts : num [1:8000] 0 1 0 0 0 1 0 0 0 0 ...
## $ current_customer : chr [1:8000] "No" "No" "No" "No" ...
## $ income : num [1:8000] 58 75 44 54 64 62 54 73 64 78 ...
## - attr(*, "spec")=
## .. cols(
## .. customer_id = col_double(),
## .. sale = col_character(),
## .. sale_amount = col_double(),
## .. transaction_Date = col_date(format = ""),
## .. region = col_double(),
## .. age = col_double(),
## .. gender = col_character(),
## .. activity = col_character(),
## .. marketing_exposure = col_double(),
## .. x1 = col_double(),
## .. x2 = col_double(),
## .. x3 = col_double(),
## .. x4 = col_double(),
## .. x5 = col_double(),
## .. num_accounts = col_double(),
## .. current_customer = col_character(),
## .. income = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
class(df_sales)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
# import pipe delimited file
df2 <-
read_delim(file='../data/chickweight.txt',
delim = '|')
df2
## # A tibble: 577 x 4
## `42` `0` `1...3` `1...4`
## <dbl> <dbl> <dbl> <dbl>
## 1 51 2 1 1
## 2 59 4 1 1
## 3 64 6 1 1
## 4 76 8 1 1
## 5 93 10 1 1
## 6 106 12 1 1
## 7 125 14 1 1
## 8 149 16 1 1
## 9 171 18 1 1
## 10 199 20 1 1
## # ... with 567 more rows
head(df_sales)
## # A tibble: 6 x 17
## customer_id sale sale_amount transaction_Date region age gender activity
## <dbl> <chr> <dbl> <date> <dbl> <dbl> <chr> <chr>
## 1 100000 Yes 309. 2017-07-02 33 67 Female Med
## 2 100001 No 0 2019-10-08 15 64 Male Low
## 3 100003 No 0 2018-05-07 8 63 Male Med
## 4 100005 No 0 2017-04-27 30 53 Male Low
## 5 100006 No 0 2016-06-06 46 62 Female Med
## 6 100007 Yes 356. 2021-06-24 27 39 Male Med
## # ... with 9 more variables: marketing_exposure <dbl>, x1 <dbl>, x2 <dbl>,
## # x3 <dbl>, x4 <dbl>, x5 <dbl>, num_accounts <dbl>, current_customer <chr>,
## # income <dbl>
# check unique values
unique(df_sales$activity)
## [1] "Med" "Low" "High"
df_sales$activity[df_sales$activity=='Med'] <-
'Medium'
We can add the transaction year like so:
library(lubridate)
df_sales$transaction_year <-
year(df_sales$transaction_Date)
And lets explore the income category:
# examine income
summary(df_sales$income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.00 58.00 65.00 65.07 72.00 103.00
# create bins:
df_sales$income_cat <-
cut(df_sales$income,
breaks = c(0,60,70,105),
labels = c('low','medium','high'))
We have covered a lot of ground of base R functionality! Does anyone have any questions about what we covered so far ?
let’s step back into PowerBI and show how to use R as a data source, building off the script we just created.