MTH-2520 Homework 7: Factors, Lists and Data Frames

Problem 1.

Problem 1a.

Create a random vector of length 20000 from the numbers 0:500

rvector_1a <- sample(0:500, 20000, replace=TRUE)

Problem 1b.

Use the cut() function to create a factor with groups of 100. (i.e. 0 to 100, 100 to 200, etc.). Present this information in a table.

example <- cut(rvector_1a, seq(0, 500, 100)) #one way to   think about this within the sequence is (from=0, to=500, group size=100)
table(example)

example
  (0,100] (100,200] (200,300] (300,400] (400,500] 
     3968      3917      3967      4121      3993

Use tapply() to find the mean of each level in the factor

tapply(rvector_1a, example, mean)

  (0,100] (100,200] (200,300] (300,400] (400,500] 
 50.30091 150.77202 250.69977 350.28949 449.85024

Use tapply() to find the mean of each level in the factor

tapply(rvector_1a, example, median)

  (0,100] (100,200] (200,300] (300,400] (400,500] 
       50       151       251       350       450

Problem 2

Problem 2a

Create a random vector of 500 letters of the alphabet

# random vector created using the letters constant in R.
letters_2a <- sample(letters, 500, replace=TRUE)
table(letters_2a)

letters_2a
 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z 
23 21 17 20 25 17 20 23 21 15 17 27 22 16 17 10 19 16 19 19 22 21 16 14 18 25

Problem 2c

Convert the vector letters_2a to a factor factor_2c. Why might we want to do this?

factor_2c <- as.factor(letters_2a)

Treating a character as a factor instead let’s you perform different types of analyses on the data. Essentially you’re taking what the computer sees as qualitative data and using it as quantitative data instead. This could be useful if you were to say, assign different numerical “rankings” to words for instance. Likert type scales would be a good example of this. Being able to work with a vector full of “goods” and “very bads” as if it was 1:5 is a useful thing to be able to do. Assuming R handles these as I think they do anyway. Having characters that work as integers under the hood is incredibly useful.

Problem 2d

Use the table() command to tabulate how many of each letters is in the factor factor_2c

table(factor_2c)

factor_2c
 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z 
23 21 17 20 25 17 20 23 21 15 17 27 22 16 17 10 19 16 19 19 22 21 16 14 18 25

Problem 2e.

Verify the class of letters_2a and factor_2c

class(letters_2a)

[1] "character"

class(factor_2c)

[1] "factor"

Problem 3

Create a list that includes the following elements: i) The first 10 Fibonacci numbers. (For this I will be shamelessly reusing old code) ii) The factor c(“even”, “odd”) iii) The data frame micedata.txt

# I COULD just write the first ten numbers manually in a vector but this seems more fun.

Fibonacci <- function(n){
  
fib <- numeric(n)
# Create a vector of length n

fib[1] <- fib[2] <- 1
# Store F 1 and F 2 in the vector

for (i in 3:n) fib[i] = fib[i-1] + fib[i-2]

# cat("The first", n, "Fibonacci numbers are", "\n")
print(fib)}

fib_10 <- Fibonacci(10)

 [1]  1  1  2  3  5  8 13 21 34 55

fib_101 <- c(1,1,2,3,5,8,13,21,34,55) #backup in-case of bugs

mice_data <- read.delim("micedata.txt")
mice_data

   color.weight.length
1 purple     23    3.8
2 yellow     21    3.7
3    red     18    3.0
4  brown     26    3.4
5  green     25    3.4
6 purple     22    3.1
7    red     26    3.5
8 purple     19    3.2

Alright, now we have what we need to work with here.

pr3_list <- list(fib_10, as.factor(c("even", "odd")), mice_data )
pr3_list

[[1]]
 [1]  1  1  2  3  5  8 13 21 34 55

[[2]]
[1] even odd 
Levels: even odd

[[3]]
   color.weight.length
1 purple     23    3.8
2 yellow     21    3.7
3    red     18    3.0
4  brown     26    3.4
5  green     25    3.4
6 purple     22    3.1
7    red     26    3.5
8 purple     19    3.2

Problem 3b

Name the elements in the list

names(pr3_list) <- c("fibb", "even/odd", "mice")

Problem 3c

Using sapply() calculate the mean, median, sum and product of the first 10 Fibonacci numbers.

sapply(pr3_list["fibb"], summary)

         fibb
Min.     1.00
1st Qu.  2.25
Median   6.50
Mean    14.30
3rd Qu. 19.00
Max.    55.00

cat("The sum of the first 10 Fibonacci numbers is", sapply(pr3_list["fibb"], sum), "\n")

The sum of the first 10 Fibonacci numbers is 143

cat("The product of the first 10 Fibonacci numbers is", sapply(pr3_list["fibb"], prod), "\n")

The product of the first 10 Fibonacci numbers is 122522400

Problem 3d

Add the string “Fibonacci rules!” as a new element to the list.

pr3_list[[4]] <- toString("Fibonacci Rules!")

Problem 3e

Remove the factor from the list.

pr3_list[[2]] <- NULL

Problem 4

Import the built-in data set “trees” using data(trees)

data(trees)

Problem 4a.

Compute the summary statistics and standard deviation, rounded to 3 decimal places, for each column.

summary(trees)

     Girth           Height       Volume     
 Min.   : 8.30   Min.   :63   Min.   :10.20  
 1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
 Median :12.90   Median :76   Median :24.20  
 Mean   :13.25   Mean   :76   Mean   :30.17  
 3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
 Max.   :20.60   Max.   :87   Max.   :77.00

sapply(trees, sd)

    Girth    Height    Volume 
 3.138139  6.371813 16.437846

sapply(trees, hist)

         Girth     Height    Volume   
breaks   integer,8 integer,7 integer,8
counts   integer,7 integer,6 integer,7
density  numeric,7 numeric,6 numeric,7
mids     numeric,7 numeric,6 numeric,7
xname    "X[[i]]"  "X[[i]]"  "X[[i]]" 
equidist TRUE      TRUE      TRUE

sapply(trees, hist)

         Girth     Height    Volume   
breaks   integer,8 integer,7 integer,8
counts   integer,7 integer,6 integer,7
density  numeric,7 numeric,6 numeric,7
mids     numeric,7 numeric,6 numeric,7
xname    "X[[i]]"  "X[[i]]"  "X[[i]]" 
equidist TRUE      TRUE      TRUE

Problem 5

pr5_str <- "I made myself a snowball As perfect as can be I thought I’d keep it as a pet And let it sleep with me I made it some pajamas And a pillow for its head Then last night it ran away But first It wet the bed"

pr5_str <- tolower(pr5_str)
str_count(pr5_str, c("a", "e", "i", "o", "u"))

[1] 20 19 14  5  2

str_replace_all(pr5_str, "p", "b")

[1] "i made myself a snowball as berfect as can be i thought i’d keeb it as a bet and let it sleeb with me i made it some bajamas and a billow for its head then last night it ran away but first it wet the bed"

Problem 6

Pascal’s Triangle arises from the coefficients of the binomial expansion \((x + y)^n\).

These are combinations \(\left(n \atop r\right) = C(n,r) = \text{choose}(n,r)\) in R.

Problem 6a

Create the following list

list_6 <- list(c(1), c(1, 1), c(1, 2, 1), c(1, 3, 3, 1))

list(1, choose(1, 0:1), choose(2, 0:2), choose(3, 0:3))

[[1]]
[1] 1

[[2]]
[1] 1 1

[[3]]
[1] 1 2 1

[[4]]
[1] 1 3 3 1

pascals_func <- function(n){
  #setting up a list with a 1 at the start.
  ret_val <- list()
  ret_val[1] <- 1
  # i will function as a variable and counter simultaneously! 
  i <- 1
  
  while (i <= n){
    # Set x to the list output of the choose function. This breaks if "list" is not specified. 
    x <- list(choose(i, 0:i))
    # append x to ret_val
    ret_val <- append(ret_val, x)
    # increments i up by 1 with every loop. Caps at n+1.
    i <- i + 1
    }
  return(ret_val)
}

pascals_func(5)

[[1]]
[1] 1

[[2]]
[1] 1 1

[[3]]
[1] 1 2 1

[[4]]
[1] 1 3 3 1

[[5]]
[1] 1 4 6 4 1

[[6]]
[1]  1  5 10 10  5  1

pascals_func(10)

[[1]]
[1] 1

[[2]]
[1] 1 1

[[3]]
[1] 1 2 1

[[4]]
[1] 1 3 3 1

[[5]]
[1] 1 4 6 4 1

[[6]]
[1]  1  5 10 10  5  1

[[7]]
[1]  1  6 15 20 15  6  1

[[8]]
[1]  1  7 21 35 35 21  7  1

[[9]]
[1]  1  8 28 56 70 56 28  8  1

[[10]]
 [1]   1   9  36  84 126 126  84  36   9   1

[[11]]
 [1]   1  10  45 120 210 252 210 120  45  10   1

Problem 7.

Import the built-in data set “iris”.

Problem 7a

Using the original data frame, compute the mean and standard deviation of the petal lengths.

cat("The mean of the petal length in iris is", round(mean(iris$Petal.Length), digits = 2), "\n")

The mean of the petal length in iris is 3.76

cat("The standard deviation of the petal length in iris is", round(sd(iris$Petal.Length), digits = 2), "\n")

The standard deviation of the petal length in iris is 1.77

Problem 7b

Compute the maximum petal width and maximum petal length.

cat("The max petal width is", max(iris$Petal.Width), "\n")

The max petal width is 2.5

cat("The max petal length is", max(iris$Petal.Length), "\n")

The max petal length is 6.9

Problem 7c

Extract the rows corresponding the species iris versicolor flowers and save it to a new data frame “VersicolorIris”.

VersicolorIris <- subset(iris, Species == "versicolor")

Problem 7d

Compute the mean and standard deviation of the Versicolor petal lengths

cat("The mean petal length of the versicolor iris is", mean(VersicolorIris$Petal.Length), "\n")

The mean petal length of the versicolor iris is 4.26

cat("The standard deviation of the versicolor iris is", round(sd(VersicolorIris$Petal.Length), digits = 2), "\n")

The standard deviation of the versicolor iris is 0.47

Problem 7e

Extract the Species and Petal Length columns and save to a new data frame ‘Petal-Length’.

# I'll be using select() from the tidyverse package here. It's so convenient.

data_7e <- as_tibble(iris)

Petal_Length <- data_7e %>% select(Petal.Length, Species)
unstack(Petal_Length)

   setosa versicolor virginica
1     1.4        4.7       6.0
2     1.4        4.5       5.1
3     1.3        4.9       5.9
4     1.5        4.0       5.6
5     1.4        4.6       5.8
6     1.7        4.5       6.6
7     1.4        4.7       4.5
8     1.5        3.3       6.3
9     1.4        4.6       5.8
10    1.5        3.9       6.1
11    1.5        3.5       5.1
12    1.6        4.2       5.3
13    1.4        4.0       5.5
14    1.1        4.7       5.0
15    1.2        3.6       5.1
16    1.5        4.4       5.3
17    1.3        4.5       5.5
18    1.4        4.1       6.7
19    1.7        4.5       6.9
20    1.5        3.9       5.0
21    1.7        4.8       5.7
22    1.5        4.0       4.9
23    1.0        4.9       6.7
24    1.7        4.7       4.9
25    1.9        4.3       5.7
26    1.6        4.4       6.0
27    1.6        4.8       4.8
28    1.5        5.0       4.9
29    1.4        4.5       5.6
30    1.6        3.5       5.8
31    1.6        3.8       6.1
32    1.5        3.7       6.4
33    1.5        3.9       5.6
34    1.4        5.1       5.1
35    1.5        4.5       5.6
36    1.2        4.5       6.1
37    1.3        4.7       5.6
38    1.4        4.4       5.5
39    1.3        4.1       4.8
40    1.5        4.0       5.4
41    1.3        4.4       5.6
42    1.3        4.6       5.1
43    1.3        4.0       5.1
44    1.6        3.3       5.9
45    1.9        4.2       5.7
46    1.4        4.2       5.2
47    1.6        4.2       5.0
48    1.4        4.3       5.2
49    1.5        3.0       5.4
50    1.4        4.1       5.1

Problem 8: THE BIRTHDAY PROBLEM

What is the probability that in a set of \(n\) randomly selected people at least two people have the same birthday?

If we have a group of 366 (or 367 for a leap year), there is a 100% probability that two people have the same birthday.

How many people do you need so that the probability is at least 50%? It is surprising to learn that only 23 people are needed for a 50% chance of two people with the same birthday. And if there are 60 people in the room, the probability increases to 99.41%.

The probability is given by the formula:

\[p(n) = 1 - \frac{365!}{365^n (365 - n)}\]

Problem 8a

birthday <- function(n){
  1 - prod(seq(from=1, to = (365 - n + 1) / 365, by = -1/365))
}
BdayProb <- sapply(1:70, birthday)

Problem 8b

plot(1:70, BdayProb, pch = 20, xlab = "Number of People", ylab = "Probability")

Problem 8d

Create a nicer plot. Run the following code.

library(reshape)
library(ggplot2)

# Input the value for npeople. Change this value to change the graph.
  npeople = 70

# Create a data frame Bday using the birthday function
  Bday = data.frame(n = 1:npeople, ProbTwoSame =
    sapply(1:npeople, birthday), y = 0.5)

  # Melt the data for casting (Smooths the data points)
  Bday = reshape::melt(Bday, id.vars = "n")
  
  ggplot(Bday, aes(x = n, y = value, colour = variable)) + geom_line() + scale_colour_hue("") + xlab("Number of People") + ylab("Probability")

I believe I got this working correctly. I ran into an issue where “+” was on a new line, so making it one giant gnarly line got it at least compiling.