Introduction to R

Table of Contents

Fork me on GitHub

Workshop Materials and Introduction

Materials and setup

You should have R installed –if not:

Download workshop materials:

What is R?

R is a programming language designed for statistical computing. Notable characteristics include:

  • Vast capabilities, wide range of statistical and graphical techniques
  • Very popular in academia, growing popularity in business: http://r4stats.com/articles/popularity/
  • Written primarily by statisticians
  • FREE (no cost, open source)
  • Excellent community support: mailing list, blogs, tutorials
  • Easy to extend by writing new functions

InspiRation

OK, it's free and popular, but what makes R worth learning? In a word, "packages". If you have a data manipulation, analysis or visualization task, chances are good that there is an R package for that. For example:

  • Want to find out where we are?
library(ggmap)
nwbuilding <- geocode("1737 Cambridge Street Cambridge, MA 02138", source = "google") 
ggmap(get_map("Cambridge, MA", zoom = 15)) +
  geom_point(data=nwbuilding, size = 7, shape = 13, color = "red")

hereweare.png

  • Want to forecast the population of Australia?
library(forecast)
fit <- auto.arima(austres)
## Projected numbers (in thousands) of Australian residents
plot(forecast(fit))

austop.png

  • Want to interactively explore the shape of the Churyumov–Gerasimenko comet?
library(plotly)
comet <- rgl::readOBJ(url("http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726"))

comet.plot <- plot_ly(x = comet$vb[1,],
                      y = comet$vb[2,],
                      z = comet$vb[3,],
                      i = comet$it[1,]-1,
                      j= comet$it[2,]-1,
                      k = comet$it[3,]-1,
                      type = "mesh3d")

setwd("images")
htmlwidgets::saveWidget(comet.plot, file = "comet.html")
setwd("..")

comet.plot

Whatever you're trying to do, you're probably not the first to try doing it R. Chances are good that someone has already written a package for that.

Graphical User Interfaces (GUIs)

R GUI alternatives

The old-school way is to run R directly in a terminal

Rconsole.png

But hardly anybody does it that way anymore! The Windows version of R comes with a GUI that looks like this:

Rgui.png

The default windows GUI is not very good

  • No parentheses matching or syntax highlighting
  • No work-space browser

RStudio (an alternative GUI for R) is shown below.

Rstudio.png

Rstudio has many useful features, including parentheses matching and auto-completion. Rstudio is not the only advanced R interface; other alteratives include Emacs with ESS (shown below).

emacs.png

Emacs + ESS is a very powerful combination, but can be difficult to set up.

Jupyter.png

Jupyter is a notebook interface that runs in your web browser. A lot of people like it. You can access these workshop notes as a Jupyter notebook at http://tutorials-live.iq.harvard.edu:8000/notebooks/workshops/R/Rintro/Rintro.ipynb

Launch RStudio   labsetup

  • Open the RStudio program
  • Open up today's R script
    • In RStudio, Go to File => Open Script
    • Locate and open the Rintro.R script in the Rintro folder on your desktop
  • Go to Tools => Set working directory => To source file location (more on the working directory later)
  • I encourage you to add your own notes to this file! Every line that starts with # is a comment that will be ignored by R. My comments all start with ##; you can add your own, possibly using # or ### to distinguish your comments from mine.

Exercise 0

The purpose of this exercise is mostly to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you've decided to use). You may not know how to do these things; that's fine! This is an opportunity to learn. If you don't know how to do something you can can use internet search engines, search on StackOverflow, or ask the person next to you.

Also keep in mind that we are living in a golden age of tab completion. If you don't know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!

  1. Try to get R to add 2 plus 2.
  2. Try to calculate the square root of 10.
  3. There is an R package named car (Companion to Applied Regression). Try to install this package.
  4. R includes extensive documentation, including a file named "An introduction to R". Try to find this help file.

Data and Functions

Assignment

Values can be assigned names and used in subsequent operations

  • The <- operator (less than followed by a dash) is used to save values
  • The name on the left gets the value on the right.
x <- 10 # Assign the value 10 to a variable named x
x + 1 # Add 1 to x
x # note that x is unchanged
y <- x + 1 # Assign y the value x + 1
y
x <- x + 100 # change the value of x
y ## note that y is unchanged.
> x <- 10 # Assign the value 10 to a variable named x
> x + 1 # Add 1 to x
[1] 11
> x # note that x is unchanged
[1] 10
> y <- x + 1 # Assign y the value x + 1
> y
[1] 11
> x <- x + 100 # change the value of x
> y ## note that y is unchanged.
[1] 11
>

Saved variables can be listed, overwritten and deleted

ls() # List variables in workspace
x # Print the value of x
x <- 100 # Overwrite x. Note that no warning is given!
x
rm(x) # Delete x
ls()
> ls() # List variables in workspace
[1] "comet"      "comet.plot" "filter"     "fit"        "nwbuilding"
[6] "x"          "y"         
> x # Print the value of x
[1] 110
> x <- 100 # Overwrite x. Note that no warning is given!
> x
[1] 100
> rm(x) # Delete x
> ls()
[1] "comet"      "comet.plot" "filter"     "fit"        "nwbuilding"
[6] "y"         
>

Data types and conversion

The x and y data objects we created are numeric vectors of length one. Vectors are the simplest data structure in R, and are the building blocks used to make more complex data structures. Here are some more vector examples.

x <- c(10, 11, 12)
X <- c("10", "11", "12")
y <- c("h", "e", "l", "l", "o")
Y <- "hello"
z <- c(1, 0, 1, 1)
Z <- c(TRUE, FALSE, TRUE, TRUE)
> x <- c(10, 11, 12)
> X <- c("10", "11", "12")
> y <- c("h", "e", "l", "l", "o")
> Y <- "hello"
> z <- c(1, 0, 1, 1)
> Z <- c(TRUE, FALSE, TRUE, TRUE)
>

Notice that the c function combines its arguments into a vector.

All R objects have a mode and length. Since it is impossible for an object not to have these attributes they are called intrinsic attributes.

print(x)
mode(x)
length(x)

print(X)
mode(X)
length(X)

length(y)
length(Y)

mode(z)
mode(Z)
> print(x)
[1] 10 11 12
> mode(x)
[1] "numeric"
> length(x)
[1] 3
> 
> print(X)
[1] "10" "11" "12"
> mode(X)
[1] "character"
> length(X)
[1] 3
> 
> length(y)
[1] 5
> length(Y)
[1] 1
> 
> mode(z)
[1] "numeric"
> mode(Z)
[1] "logical"
>

Data structures in R can be converted from one type to another using one of the many functions beginning with as.. For example:

mode(x)
mode(as.character(x))
mode(X)
mode(as.numeric(X))
> mode(x)
[1] "numeric"
> mode(as.character(x))
[1] "character"
> mode(X)
[1] "character"
> mode(as.numeric(X))
[1] "numeric"
>

Functions

Using R is mostly about applying functions to variables. Functions

  • take variable(s) as input argument(s)
  • perform operations
  • return values which can be assigned
  • optionally perform side-effects such as writing a file to disk or opening a graphics window

The general form for calling R functions is

## FunctionName(arg.1, arg.2, ..., arg.n)

Arguments can be matched by position or name

Examples:

  #?sqrt
  z <- c(10, 11, 12)
  a <- sqrt(z) # Call the sqrt function with argument x=z

## look at the arguments to the round function
  args(round) # use ?round if you need more information

  round(a, digits = 2) # Call round() with arguments x=x and digits=2

  ## since matching by name takes precedence these are all equivalent:
  round(a, 2)
  round(x = a, 2)
  round(digits = 2, x = a)

  ## the only way we can go wrong is by omiting the names and mixing up the order
  round(2, z)

  # Functions can be nested so an alternative is
  round(sqrt(z), digits = 2) # Take sqrt of a and round
>   #?sqrt
>   z <- c(10, 11, 12)
>   a <- sqrt(z) # Call the sqrt function with argument x=z
>   
> ## look at the arguments to the round function
>   args(round) # use ?round if you need more information
function (x, digits = 0) 
NULL
> 
>   round(a, digits = 2) # Call round() with arguments x=x and digits=2
[1] 3.16 3.32 3.46
> 
>   ## since matching by name takes precedence these are all equivalent:
>   round(a, 2)
[1] 3.16 3.32 3.46
>   round(x = a, 2)
[1] 3.16 3.32 3.46
>   round(digits = 2, x = a)
[1] 3.16 3.32 3.46
> 
>   ## the only way we can go wrong is by omiting the names and mixing up the order
>   round(2, z)
[1] 2 2 2
> 
>   # Functions can be nested so an alternative is
>   round(sqrt(z), digits = 2) # Take sqrt of a and round
[1] 3.16 3.32 3.46
>

Asking R for help

R has extensive built-in documentation that can be accessed through R commands or through the GUI.

## Start html help, search/browse using web browser
help.start() # or use the help menu from you GUI
## Look up the documentation for a function
help(plot) ## or use the shortcut: ?plot
## Look up documentation for a package
help(package="stats")
## Search documentation from R (not always the best way... google often works better)
help.search("classification")

R packages

There are thousands of R packages that extend R's capabilities. Some packages are distributed with R, and some of these are attached to the search path by default. Many more are available in package repositories.

##To see what packages are loaded: 
search()

## To view available packages: 
library()

## To load a package: 
library("MASS")

## Install new package: 
install.packages("stringdist")

In this workshop we will use the tidyverse package. tidyverse is a meta package that loads the dplyr package for easier data manipulation the readr package for easier data import/export, and several other useful packages. See https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/

Exercise 1

The purpose of this exercise is to practice using the package management and help facilities.

  1. Use the search function to inspect the current search path. Assign the result to the name orig.search.path.
  2. What are the mode and length of orig.search.path?
  3. Install the tidyverse package. Compare the output of search() to the value you've saved in orig.search.path. Has it changed?
  4. Use the library function to attach the tidyverse package. Compare the output of search() to the value you've saved in orig.search.path. Has it changed?
  5. Look up the help page for the readr package. Which function would you use to read a tab delimited file?

Getting data into R

The baby names data set

The examples in this workshop use the baby names data provided by the governments of the United States and the United Kingdom. A cleaned and merged version of these data is in dataSets/babyNames.csv.

The "working directory" and listing files

R knows the directory it was started in, and refers to this as the "working directory". Since our workshop examples are in the Rintro folder in your Downloads folder, we should all take a moment to set that as our working directory.

getwd() # what is my current working directory?
# setwd("~/Desktop/Rintro") # change directory

Note that "~" means "my home directory" but that this can mean different things on different operating systems. You can also use the Files tab in Rstudio to navigate to a directory, then click "More -> Set as working directory".

We can a set the working directory using paths relative to the current working directory. Once we are in the "Rintro" folder we can navigate to the "dataSets" folder like this:

getwd() # get the current working directory
setwd("dataSets") # set wd to the dataSets folder
getwd()
setwd("..") # set wd to enclosing folder ("up")
> getwd() # get the current working directory
[1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rintro"
> setwd("dataSets") # set wd to the dataSets folder
> getwd()
[1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rintro/dataSets"
> setwd("..") # set wd to enclosing folder ("up")
>

It can be convenient to list files in a directory without leaving R

list.files("dataSets") # list files in the dataSets folder
> list.files("dataSets") # list files in the dataSets folder
[1] "babyNames.csv"
>

Readers for common file types

In order to read data from a file, you have to know what kind of file it is. The table below lists the functions that can import data from common file formats.

data type function package
comma separated (.csv) readcsv() readr (tidyverse)
other delimited formats readdelim() readr (tidyverse)
R (.Rds) readrds() readr (tidyverse)
Stata (.dta) readstata() haven (tidyverse, needs to be attached separately)
SPSS (.sav) readspss() haven (tidyverse, needs to be attached separately)
SAS (.sas7bdat) readsas() haven (tidyverse, needs to be attached separately)
Excel (.xls, .xlsx) readexcel readxl (tidyverse, needs to be attached separately)

Exercise 2

The purpose of this exercise is to practice reading data into R. The data in "dataSets/babyNames.csv" is moderately tricky to read, making it a good data set to practice on.

  1. Open the help page for the read_csv function. How can you limit the number of rows to be read in?
  2. Read just the first 10 rows of "dataSets/babyNames.csv". Notice that the "Sex" column has been read as a logical (TRUE/FALSE).
  3. Read the read_csv help page to figure out how to make it read the "Sex" column as a character. Make adjustments to your code until you have read in the first 10 rows with the correct column types. "Year" and "Name.length" should be integer (int), "Count" and "Percent" should be double (dbl) and everything else should be character (chr).
  4. Once you have successfully read in the first 10 rows, read the whole file, assigning the result to the name baby.names.

Checking imported data

It is always a good idea to examine the imported data set–usually we want the results to be a data.frame

## we know that this object will have mode and length, because all R objects do.
mode(baby.names)
length(baby.names) # number of columns

## additional information about this data object
class(baby.names) # check to see that test is a data.frame
dim(baby.names) # how many rows and columns?
names(baby.names) # or colnames(baby.names)
str(baby.names) # more details
glimpse(baby.names) # details, more compactly
> ## we know that this object will have mode and length, because all R objects do.
> mode(baby.names)
[1] "list"
> length(baby.names) # number of columns
[1] 7
> 
> ## additional information about this data object
> class(baby.names) # check to see that test is a data.frame
[1] "tbl_df"     "tbl"        "data.frame"
> dim(baby.names) # how many rows and columns?
[1] 1966001       7
> names(baby.names) # or colnames(baby.names)
[1] "Location"    "Year"        "Sex"         "Name"       
[5] "Count"       "Percent"     "Name.length"
> str(baby.names) # more details
Classes 'tbl_df', 'tbl' and 'data.frame':	1966001 obs. of  7 variables:
 $ Location   : chr  "England and Wales" "England and Wales" "England and Wales" "England and Wales" ...
 $ Year       : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
 $ Sex        : chr  "F" "F" "F" "F" ...
 $ Name       : chr  "sophie" "chloe" "jessica" "emily" ...
 $ Count      : num  7087 6824 6711 6415 6299 ...
 $ Percent    : num  2.39 2.31 2.27 2.17 2.13 ...
 $ Name.length: int  6 5 7 5 6 6 9 7 3 5 ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 7
  .. ..$ Location   : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ Year       : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Sex        : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ Name       : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ Count      : list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Percent    : list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Name.length: list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"
> glimpse(baby.names) # details, more compactly
Observations: 1,966,001
Variables: 7
$ Location    <chr> "England and Wales", "England and Wales", "En...
$ Year        <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 199...
$ Sex         <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", ...
$ Name        <chr> "sophie", "chloe", "jessica", "emily", "laure...
$ Count       <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 582...
$ Percent     <dbl> 2.3942729, 2.3054210, 2.2672450, 2.1672444, 2...
$ Name.length <int> 6, 5, 7, 5, 6, 6, 9, 7, 3, 5, 7, 5, 7, 4, 4, ...
>

Data Manipulation

data.frame objects

Usually data read into R will be stored as a data.frame

  • A data.frame is a list of vectors of equal length
    • Each vector in the list forms a column
    • Each column can be a differnt type of vector
    • Typically columns are variables and the rows are observations

A data.frame has two dimensions corresponding the number of rows and the number of columns (in that order)

Slice and Filter data.frames rows

You can extract subsets of data.frames using slice to select rows by number and filter to select rows that match some condition. It works like this:

## make up some example data
(example.df <- data.frame(id  = rep(letters[1:4], each = 4),
                          t   = rep(1:4, times = 4),
                          var1 = runif(16),
                          var2 = sample(letters[1:3], 16, replace = TRUE)))

## rows 2 and 4
slice(example.df, c(2, 4))

## rows where id == "a"
filter(example.df, id == "a")

## rows where id is either "a" or "b"
filter(example.df, id %in% c("a", "b"))
> ## make up some example data
> (example.df <- data.frame(id  = rep(letters[1:4], each = 4),
+                           t   = rep(1:4, times = 4),
+                           var1 = runif(16),
+                           var2 = sample(letters[1:3], 16, replace = TRUE)))
   id t       var1 var2
1   a 1 0.19158254    c
2   a 2 0.46921828    c
3   a 3 0.71092483    c
4   a 4 0.81892913    c
5   b 1 0.17894751    b
6   b 2 0.94742350    c
7   b 3 0.24051714    c
8   b 4 0.41236477    c
9   c 1 0.66573374    c
10  c 2 0.72515137    b
11  c 3 0.66923824    b
12  c 4 0.41101666    b
13  d 1 0.06401198    c
14  d 2 0.34580213    b
15  d 3 0.44477036    c
16  d 4 0.12253790    b
> 
> ## rows 2 and 4
> slice(example.df, c(2, 4))
  id t      var1 var2
1  a 2 0.4692183    c
2  a 4 0.8189291    c
> 
> ## rows where id == "a"
> filter(example.df, id == "a")
  id t      var1 var2
1  a 1 0.1915825    c
2  a 2 0.4692183    c
3  a 3 0.7109248    c
4  a 4 0.8189291    c
> 
> ## rows where id is either "a" or "b"
> filter(example.df, id %in% c("a", "b"))
  id t      var1 var2
1  a 1 0.1915825    c
2  a 2 0.4692183    c
3  a 3 0.7109248    c
4  a 4 0.8189291    c
5  b 1 0.1789475    b
6  b 2 0.9474235    c
7  b 3 0.2405171    c
8  b 4 0.4123648    c
>

Select data.frame columns

slice and filter are used to extract rows. select is used to extract columns

select(example.df, id, var1)
select(example.df, id, t, var1)
> select(example.df, id, var1)
   id       var1
1   a 0.19158254
2   a 0.46921828
3   a 0.71092483
4   a 0.81892913
5   b 0.17894751
6   b 0.94742350
7   b 0.24051714
8   b 0.41236477
9   c 0.66573374
10  c 0.72515137
11  c 0.66923824
12  c 0.41101666
13  d 0.06401198
14  d 0.34580213
15  d 0.44477036
16  d 0.12253790
> select(example.df, id, t, var1)
   id t       var1
1   a 1 0.19158254
2   a 2 0.46921828
3   a 3 0.71092483
4   a 4 0.81892913
5   b 1 0.17894751
6   b 2 0.94742350
7   b 3 0.24051714
8   b 4 0.41236477
9   c 1 0.66573374
10  c 2 0.72515137
11  c 3 0.66923824
12  c 4 0.41101666
13  d 1 0.06401198
14  d 2 0.34580213
15  d 3 0.44477036
16  d 4 0.12253790
>

You can also conveniently select a single column using $, like this:

example.df$t
> example.df$t
 [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
>

Data manipulation commands can be combined:

filter(select(example.df,
              id,
              var1),
       id == "a")
> filter(select(example.df,
+               id,
+               var1),
+        id == "a")
  id      var1
1  a 0.1915825
2  a 0.4692183
3  a 0.7109248
4  a 0.8189291
>

In the previous example we used == to filter rows where id was "a". Other relational and logical operators are listed below.

Operator Meaning
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% contained in
& and
| or

Adding, removing, and modifying data.frame columns

You can modify data.frames using the mutate() function. It works like this:

example.df

## modify example.df and assign the modified data.frame the name example.df
example.df <- mutate(example.df,
       var2 = var1/t, # replace the values in var2
       var3 = 1:length(t), # create a new column named var3
       var4 = factor(letters[t]),
       t = NULL # delete the column named t
       )

## examine our changes
example.df
> example.df
   id t       var1 var2
1   a 1 0.19158254    c
2   a 2 0.46921828    c
3   a 3 0.71092483    c
4   a 4 0.81892913    c
5   b 1 0.17894751    b
6   b 2 0.94742350    c
7   b 3 0.24051714    c
8   b 4 0.41236477    c
9   c 1 0.66573374    c
10  c 2 0.72515137    b
11  c 3 0.66923824    b
12  c 4 0.41101666    b
13  d 1 0.06401198    c
14  d 2 0.34580213    b
15  d 3 0.44477036    c
16  d 4 0.12253790    b
> 
> ## modify example.df and assign the modified data.frame the name example.df
> example.df <- mutate(example.df,
+        var2 = var1/t, # replace the values in var2
+        var3 = 1:length(t), # create a new column named var3
+        var4 = factor(letters[t]),
+        t = NULL # delete the column named t
+        )
> 
> ## examine our changes
> example.df
   id       var1       var2 var3 var4
1   a 0.19158254 0.19158254    1    a
2   a 0.46921828 0.23460914    2    b
3   a 0.71092483 0.23697494    3    c
4   a 0.81892913 0.20473228    4    d
5   b 0.17894751 0.17894751    5    a
6   b 0.94742350 0.47371175    6    b
7   b 0.24051714 0.08017238    7    c
8   b 0.41236477 0.10309119    8    d
9   c 0.66573374 0.66573374    9    a
10  c 0.72515137 0.36257569   10    b
11  c 0.66923824 0.22307941   11    c
12  c 0.41101666 0.10275416   12    d
13  d 0.06401198 0.06401198   13    a
14  d 0.34580213 0.17290107   14    b
15  d 0.44477036 0.14825679   15    c
16  d 0.12253790 0.03063448   16    d
>

Exporting Data

Now that we have made some changes to our data set, we might want to save those changes to a file.

# write data to a .csv file
write_csv(example.df, path = "example.csv")

# write data to an R file
write_rds(example.df, path = "example.rds")

# write data to a Stata file
library(haven)
write_dta(example.df, path = "example.dta")

Saving and loading R workspaces

In addition to importing individual datasets, R can save and load entire workspaces

ls() # list objects in our workspace
save.image(file="myWorkspace.RData") # save workspace 
rm(list=ls()) # remove all objects from our workspace 
ls() # list stored objects to make sure they are deleted
> ls() # list objects in our workspace
 [1] "a"                "baby.names"       "comet"           
 [4] "comet.plot"       "example.df"       "filter"          
 [7] "fit"              "nwbuilding"       "orig.search.path"
[10] "x"                "X"                "y"               
[13] "Y"                "z"                "Z"               
> save.image(file="myWorkspace.RData") # save workspace 
> rm(list=ls()) # remove all objects from our workspace 
> ls() # list stored objects to make sure they are deleted
character(0)
>

Load the "myWorkspace.RData" file and check that it is restored

load("myWorkspace.RData") # load myWorkspace.RData
ls() # list objects
> load("myWorkspace.RData") # load myWorkspace.RData
> ls() # list objects
 [1] "a"                "baby.names"       "comet"           
 [4] "comet.plot"       "example.df"       "filter"          
 [7] "fit"              "nwbuilding"       "orig.search.path"
[10] "x"                "X"                "y"               
[13] "Y"                "z"                "Z"               
>

Exercise 3: Data manipulation

Read in the "babyNames.csv" file if you have not already done so, assigning the result to baby.names.

  1. Filter baby.names to show only names given to at least 3 percent of boys.
  2. Create a column named "Proportion" equal to Percent divided by 100.
  3. Filter baby.names to include only names given to at least 3 percent of Girls. Save this to a Stata data set named "popularGirlNames.dta")

Statistics by grouping variable(s)

The summarize function can be used to calculate statistics by grouping variable. Here is how it works.

summarize(group_by(example.df, id), mean(var1), sd(var1))
> summarize(group_by(example.df, id), mean(var1), sd(var1))
# A tibble: 4 × 3
      id `mean(var1)` `sd(var1)`
  <fctr>        <dbl>      <dbl>
1      a    0.5476637  0.2787990
2      b    0.4448132  0.3493286
3      c    0.6177850  0.1405077
4      d    0.2442806  0.1805739
>

You can group by multiple variables:

summarize(group_by(example.df, id, var3), mean(var1), sd(var1))
> summarize(group_by(example.df, id, var3), mean(var1), sd(var1))
Source: local data frame [16 x 4]
Groups: id [?]

       id  var3 `mean(var1)` `sd(var1)`
   <fctr> <int>        <dbl>      <dbl>
1       a     1   0.19158254         NA
2       a     2   0.46921828         NA
3       a     3   0.71092483         NA
4       a     4   0.81892913         NA
5       b     5   0.17894751         NA
6       b     6   0.94742350         NA
7       b     7   0.24051714         NA
8       b     8   0.41236477         NA
9       c     9   0.66573374         NA
10      c    10   0.72515137         NA
11      c    11   0.66923824         NA
12      c    12   0.41101666         NA
13      d    13   0.06401198         NA
14      d    14   0.34580213         NA
15      d    15   0.44477036         NA
16      d    16   0.12253790         NA
>

#+ENDSRC

Save R output to a file

Earlier we learned how to write a data set to a file. But what if we want to write something that isn't in a nice rectangular format, like the output of summary? For that we can use the sink() function:

sink(file="output.txt", split=TRUE) # start logging
print("This is the summary of example.df \n")
print(summary(example.df))
sink() ## sink with no arguments turns logging off
> sink(file="output.txt", split=TRUE) # start logging
> print("This is the summary of example.df \n")
[1] "This is the summary of example.df \n"
> print(summary(example.df))
 id         var1              var2              var3       var4 
 a:4   Min.   :0.06401   Min.   :0.03063   Min.   : 1.00   a:4  
 b:4   1st Qu.:0.22828   1st Qu.:0.10301   1st Qu.: 4.75   b:4  
 c:4   Median :0.42857   Median :0.18527   Median : 8.50   c:4  
 d:4   Mean   :0.46364   Mean   :0.21711   Mean   : 8.50   d:4  
       3rd Qu.:0.67966   3rd Qu.:0.23520   3rd Qu.:12.25        
       Max.   :0.94742   Max.   :0.66573   Max.   :16.00        
> sink() ## sink with no arguments turns logging off
>

Exercise 4

  1. Calculate the total number of children born.
  2. Filter the data to extract only Massachusetts (Location "MA"), and calculate the total number of children born in Massachusetts.
  3. Group and summarize the data to calculate the number of children born each year.
  4. Calculate the average number of characters in baby names (using the "Name.length" column).
  5. Group and summarize to calculate the average number of characters in baby names for each location.

Basic graphics: Frequency bars

Thanks to classes and methods, you can plot() many kinds of objects:

plot(example.df$var4)

examplePlot1.png

Basic graphics: Boxplots by group

Thanks to classes and methods, you can plot() many kinds of objects:

plot(select(example.df, id, var1))

examplePlot2.png

Basic graphics: Mosaic chart

Thanks to classes and methods, you can plot() many kinds of objects:

plot(select(example.df, id, var4))

examplePlot3.png

Basic graphics: scatter plot

plot(select(example.df, var1, var2))

examplePlot4.png

Wrap-up

Help us make this workshop better!

Additional resources

These workshop notes by Harvard University are licensed Creative Commons License. Presented by Data Science Services at IQSS