Simulation and Computation

Setup Stuff

Let’s get started!

Listing 1: Loading Packages and R Setup
Code
library(tidyverse)
library(scidesignR)
library(knitr)
library(kableExtra)
library(countrycode)
library(skimr)

set.seed(4240)

Exercise 2.3

NoteExercise Instructions

In Section 2.3, 20 patients for Example 2.5 were randomly selected from a population of 25,000.

(a) Simple Random Sample Probability

NoteExercise Instructions

Explain why the probability of choosing a random sample of 20 from 25,000 is \[ \frac{1} {25000 \choose 20} \]

Let’s start with smaller population size, 6, {A, B, C, D, E, F}. And, let’s start with a smaller sample size, 1. We can pick one of {A}, {B}, {C}, {D}, {E}, or {F}. That is, there are 6 ways to have a sample of size 1. The probability of choosing a random sample of size 1 from a population of size 6 is 1/6.

Let’s expand to a sample size 2. We know that once an individual is added to our sample, they can’t be picked again, and it doesn’t matter what order the individuals are added to the sample. Then, these are all the possible ways that we could pick them: {A, B} {A, C} {A, D} {A, E} {A, F}

{B, C} {B, D} {B, E} {B, F}

{C, D} {C, E} {C, F}

{D, E} {D, F}

{E, F}

That is, there are 15 ways to pick a sample of size 2. The probability of choosing a random sample of size 2 from a population of size 6 is 1/15.

We can recognize that the number of ways to pick a sample of size r from a population of size n is the binomial coefficient \(\frac{n!}{(r!(n-r)!)}\) (i.e., “n choose r”).

(b) Simple Random Sample Probability with R

NoteExercise Instructions

Use R to randomly select 4 patients from a total of 8 patients, and calculate the probability of choosing this random sample.

Explored with R

Listing 2: Random Sample Probability
Code
# Input Parameters
populationSize <- 8
sampleSize <- 4

# Population Table
patients <- tibble(id = 1:populationSize)
# Pick a sample
chosenRandomSample <- patients |>
    sample_n(sampleSize)

# How many random samples are possible?
numberOfRandomSamples <- choose(populationSize, sampleSize)

# Probability of choosing this particular random sample
probabilityOfChosenRandomSample <- (1 / numberOfRandomSamples)

# Results
probabilityOfChosenRandomSampleRounded <- round(probabilityOfChosenRandomSample, 3)
probabilityOfChosenRandomSampleRounded

Visualized with R

Code
patients |> 
    mutate(
        "Selected for Sample" = if_else(id %in% chosenRandomSample$id, "TRUE", "FALSE")
    ) |> 
    kbl() |> 
    kable_styling()
Table 1: Randonly Generated Sample
id Selected for Sample
1 FALSE
2 FALSE
3 TRUE
4 FALSE
5 TRUE
6 TRUE
7 TRUE
8 FALSE

Therefore, the probability of choosing this random sample is \(\frac{1}{70} \approx 0.014\).

Exercise 2.4

NoteExercise Instructions

The treatment assignment displayed in Table 2.5 is from one treatment assignment out of \({20 \choose 10}\) possible random treatment assignments.

(a) All Possible Treatment Assignments

NoteExercise Instructions

R can be used to generate all possible combinations of 10 patients assigned to the treatment group using combn(covid_sample$id, 10). Use the command to generate the combinations and confirm that there are \({20 \choose 10}\) ways to generate the treatment assignments.

Code
# Variables
covid_patientPopulationSize = 25000
covid_treatmentSampleSize <- 20
covid_activeTreatmentSampleSize <- 10 # the number of patients to be assigned to the active treatment (i.e., no placebo)

#################################
# The population of Covid Patients
covid_patientPopulation <- tibble(id = 1:covid_patientPopulationSize)

# The sample of covid patients that we will treat
covid_treatmentSamplePatients <- covid_patientPopulation |> 
    sample_n(covid_treatmentSampleSize)

# Of the sample of patients we will treat, every possible combination of those that will receive the active treatment
allPossibleActiveTreatmentSamplePatients <- covid_treatmentSamplePatients$id |> 
    combn(covid_activeTreatmentSampleSize) |> 
    as_tibble()
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
Code
expectedActiveTreatmentSampleCombinationsSize <- choose(covid_treatmentSampleSize, covid_activeTreatmentSampleSize)


###########################
covid_treatmentSamplePatients |> 
    kbl(caption="COVID Patients to Receive Treatment Sample") |> 
    kable_styling()
COVID Patients to Receive Treatment Sample
id
16715
17290
20264
14181
10921
4883
16803
19022
16250
18137
1475
9472
15340
8489
13396
20749
17983
14717
15458
18309
Code
ncol(allPossibleActiveTreatmentSamplePatients)
[1] 184756
Code
expectedActiveTreatmentSampleCombinationsSize
[1] 184756

(b) Treatment Counts Under Complete Randomization

NoteExercise Instructions

To generate the random array of treatment assignments in Table 2.5, R was used to randomly select 10 patients for the active treatment first and then assigned the placebo to the remaining 10. We can also generate a random treatment assignment vector by randomly selecting a column from the matrix of all possible treatment assignments created in part a. In R, randomly select a column and assign the first 10 to the active treatment and the rest to the placebo. Repeat a few times and discuss whether the procedure results in random assignments as desired. (Hint: You can use if_else(condition, true, false) inside mutate to return true when condition is met and false otherwise.)

Code
# Pick a random number between 1 and the total number of combinations to choose the patients that will get the active treatment
randomColumnNumberFrom_allPossibleActiveTreatmentSamplePatients <- sample(1:ncol(allPossibleActiveTreatmentSamplePatients),1)
# Get all the patient IDs that will get the active treatment
covid_activeTreatmentSamplePatients <- allPossibleActiveTreatmentSamplePatients |> 
    pull(randomColumnNumberFrom_allPossibleActiveTreatmentSamplePatients)


########################################
display <- covid_treatmentSamplePatients |> 
    mutate(
        treatmentAssignment = if_else(id %in% covid_activeTreatmentSamplePatients, "Active", "Placebo")
    ) |> 
    arrange(treatmentAssignment, id)

index <- auto_index(display$treatmentAssignment)

display |> 
    kbl() |> 
    pack_rows(index = index) |> 
    kable_styling()
id treatmentAssignment
Active
13396 Active
14181 Active
15340 Active
15458 Active
16250 Active
16803 Active
17983 Active
19022 Active
20264 Active
20749 Active
Placebo
1475 Placebo
4883 Placebo
8489 Placebo
9472 Placebo
10921 Placebo
14717 Placebo
16715 Placebo
17290 Placebo
18137 Placebo
18309 Placebo

Our first step uses a simple random sample of 20 patients from our population to receive treatment. Using this sample, we generate all possible combinations of choosing 10 patients. (i.e., \({20 \choose 10} = 184,756\) possible combinations) Picking a random number from 1:184756 and giving those 10 patients the active treatment is still a simple random sample. We are always guaranteed to have exactly 20 randomly selected patients from the population, and exactly 10 of those will be randomly selected to receive the active treatment. This proceudre does result in random assignments as desired.

(c) Comparison to Independent Assignment

NoteExercise Instructions

Another scheme we may consider is to assign TRT or PLA to each patient with equal probabilities independently. Implement the procedure in R. Repeat a few times and discuss whether the procedure results in random assignments as desired.

Code
# "Equal Probability Assuming Independence"
equalProbabilty <- sample(c("TRT","PLA"), size = covid_treatmentSampleSize, replace = TRUE)

# Assign TRT or PLA to each patient independently
covid_treatmentSamplePatients |> 
    mutate(
        treatmentAssignment = sample(c("TRT","PLA"), size = covid_treatmentSampleSize, replace = TRUE)
    ) |> 
    select(treatmentAssignment) |> 
    group_by(treatmentAssignment) |> 
    summarize(n = n()) |> 
    kbl() |> 
    kable_styling()
treatmentAssignment n
PLA 9
TRT 11
Code
## TODO: run a loop and document the distribution ... good to plot it as well. 
df <- data.frame(id = 1:100, trialTRT = NA)
for (trial in 1:100) {
    currentTrial <- covid_treatmentSamplePatients |> 
    mutate(
        treatmentAssignment = sample(c("TRT","PLA"), size = covid_treatmentSampleSize, replace = TRUE)
    ) |> 
    select(treatmentAssignment) |> 
    group_by(treatmentAssignment) |> 
    summarize(n = n()) 

    df$trialTRT[trial] <- currentTrial$n[currentTrial$treatmentAssignment=="TRT"]
}

dict <- data.frame(
    TRT = 0:20,
    n = 0
)
for (i in 1:100) {
    currentValue <- df[i,]$trialTRT
    dict[dict$TRT==currentValue,]$n <- dict[dict$TRT==currentValue,]$n + 1
}

df |> 
    group_by(TRT = trialTRT) |> 
    select(TRT) |> 
    summarise(n = n())
# A tibble: 13 × 2
     TRT     n
   <int> <int>
 1     4     1
 2     5     1
 3     6     3
 4     7     8
 5     8    11
 6     9    20
 7    10    13
 8    11    16
 9    12    11
10    13     7
11    14     7
12    15     1
13    16     1
Code
dict |> 
    ggplot(aes(
        x = factor(TRT),
        y = n
        )) +
    geom_col() +
    geom_vline(xintercept = 11)

(d) Alternatives

NoteExercise Instructions

Can you think of another procedure in R for randomly assigning 10 patients to active treatment? Implement your own procedure in R to verify the procedure results in random assignments.

Code
covid_treatmentSamplePatients |> 
    arrange(id) |> ### Obviously when this step is added, it won't be random.
    mutate(treatmentAssignment = if_else(row_number(id) <= covid_activeTreatmentSampleSize, "TRT", "PLA"))
# A tibble: 20 × 2
      id treatmentAssignment
   <int> <chr>              
 1  1475 TRT                
 2  4883 TRT                
 3  8489 TRT                
 4  9472 TRT                
 5 10921 TRT                
 6 13396 TRT                
 7 14181 TRT                
 8 14717 TRT                
 9 15340 TRT                
10 15458 TRT                
11 16250 PLA                
12 16715 PLA                
13 16803 PLA                
14 17290 PLA                
15 17983 PLA                
16 18137 PLA                
17 18309 PLA                
18 19022 PLA                
19 20264 PLA                
20 20749 PLA                
Code
newDict <- data.frame(
    id = 0,
    n = 0
)

for(i in 1:1000) {
    this_covid_treatmentSamplePatients <- covid_patientPopulation |> 
        sample_n(covid_treatmentSampleSize) |> 
        arrange(id) |> 
        mutate(treatmentAssignment = if_else(row_number(id) <= covid_activeTreatmentSampleSize, "TRT", "PLA"))


}

Exercise 2.11

NoteExercise Instructions

A chemist has seven light objects to weigh on a balance pan scale. The standard deviation of each weighing is denoted by \(\sigma\). In a 1935 paper, Frank Yates suggested an improved technique by weighing all seven objects together, and also weighing them in groups of three. The groups are chosen so that each object is weighed four times altogether, twice with any other object and twice without it.

Let \(y_1,...,y_8\) be the readings from the scale so that the equations for determining the unknown weights, \(\beta_1,...,\beta_7\), are

\[ \begin{align*} y_1 &= \beta_1 + \beta_2 + \beta_3 + \beta_4 + \beta_5 + \beta_6 + \beta_7 + \epsilon_1 \\ y_2 &= \beta_1 + \beta_2 + \beta_3 + \epsilon_2 \\ y_3 &= \beta_1 + \beta_4 + \beta_5 + \epsilon_3 \\ y_4 &= \beta_1 + \beta_6 + \beta_7 + \epsilon_4 \\ y_5 &= \beta_2 + \beta_4 + \beta_6 + \epsilon_5 \\ y_6 &= \beta_2 + \beta_5 + \beta_7 + \epsilon_6 \\ y_7 &= \beta_3 + \beta_4 + \beta_7 + \epsilon_7 \\ y_8 &= \beta_3 + \beta_5 + \beta_6 + \epsilon_8 \end{align*} \]

where the \(\epsilon_i,i=1,...,8\) are independent errors.

Hotelling suggested modifying Yates’ procedure by placing in the other pan of the scale those of the objects not included in one of his weighings. In other words if the first three objects are to be weighed, then the remaining four objects would be placed in the opposite pan.

(a) Yates’ Procedure

NoteExercise Instructions

Write Yates’ procedure in matrix form and find the least squares estimates of β.

Yates’ procedure can be written using the matrix form:

\[ y = X\beta + \epsilon \] \[ \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \\ y_8 \end{pmatrix} = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 & 1 & 1 & 0 \end{pmatrix} \begin{pmatrix} \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \beta_5 \\ \beta_6 \\ \beta_7 \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \\ \epsilon_5 \\ \epsilon_6 \\ \epsilon_7 \\ \epsilon_8 \end{pmatrix} \]

Code
Yates <- matrix(
    c(
    1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 0, 0, 0, 0,
    1, 0, 0, 1, 1, 0, 0,
    1, 0, 0, 0, 0, 1, 1,
    0, 1, 0, 1, 0, 1, 0,
    0, 1, 0, 0, 1, 0, 1,
    0, 0, 1, 1, 0, 0, 1,
    0, 0, 1, 0, 1, 1, 0
  ),
  nrow = 8,
  ncol = 7,
  byrow = TRUE
)

# X
X <- Yates
X
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1    1    1    1    1    1    1
[2,]    1    1    1    0    0    0    0
[3,]    1    0    0    1    1    0    0
[4,]    1    0    0    0    0    1    1
[5,]    0    1    0    1    0    1    0
[6,]    0    1    0    0    1    0    1
[7,]    0    0    1    1    0    0    1
[8,]    0    0    1    0    1    1    0
Code
# X'X
Y <- t(X) %*% X
Y
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    4    2    2    2    2    2    2
[2,]    2    4    2    2    2    2    2
[3,]    2    2    4    2    2    2    2
[4,]    2    2    2    4    2    2    2
[5,]    2    2    2    2    4    2    2
[6,]    2    2    2    2    2    4    2
[7,]    2    2    2    2    2    2    4
Code
# (X'X)^-1
W <- solve(Y)
W
        [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]
[1,]  0.4375 -0.0625 -0.0625 -0.0625 -0.0625 -0.0625 -0.0625
[2,] -0.0625  0.4375 -0.0625 -0.0625 -0.0625 -0.0625 -0.0625
[3,] -0.0625 -0.0625  0.4375 -0.0625 -0.0625 -0.0625 -0.0625
[4,] -0.0625 -0.0625 -0.0625  0.4375 -0.0625 -0.0625 -0.0625
[5,] -0.0625 -0.0625 -0.0625 -0.0625  0.4375 -0.0625 -0.0625
[6,] -0.0625 -0.0625 -0.0625 -0.0625 -0.0625  0.4375 -0.0625
[7,] -0.0625 -0.0625 -0.0625 -0.0625 -0.0625 -0.0625  0.4375
Code
# (X'X)^-1 * X'
A <- W %*% t(X)
A
       [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]
[1,] 0.0625  0.3125  0.3125  0.3125 -0.1875 -0.1875 -0.1875 -0.1875
[2,] 0.0625  0.3125 -0.1875 -0.1875  0.3125  0.3125 -0.1875 -0.1875
[3,] 0.0625  0.3125 -0.1875 -0.1875 -0.1875 -0.1875  0.3125  0.3125
[4,] 0.0625 -0.1875  0.3125 -0.1875  0.3125 -0.1875  0.3125 -0.1875
[5,] 0.0625 -0.1875  0.3125 -0.1875 -0.1875  0.3125 -0.1875  0.3125
[6,] 0.0625 -0.1875 -0.1875  0.3125  0.3125 -0.1875 -0.1875  0.3125
[7,] 0.0625 -0.1875 -0.1875  0.3125 -0.1875  0.3125  0.3125 -0.1875

Let \(X\) be the design matrix, where each row \(i\) corresponds to an individual instance of weighing, and where each column \(j\) represents a “light object”.

\[ X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 & 1 & 1 & 0 \end{pmatrix} \]

Then each entry in the design matrix is defined as

\[ x_{ij} = \begin{cases} 1 &\text{object } j \text{ is on the scale in weighing } i \\ 0 &\text{object } j \text{ is not included in weighing } i \end{cases} \]

Then, it follows that a column of \(X\) is the participation pattern of one object across all weighings. For example Object 1 is in Weighing 1, 2, 3, and 4.

And so the normal matrix is:

\[ X^{T}X = \begin{pmatrix} 4 & 2 & 2 & 2 & 2 & 2 & 2 \\ 2 & 4 & 2 & 2 & 2 & 2 & 2 \\ 2 & 2 & 4 & 2 & 2 & 2 & 2 \\ 2 & 2 & 2 & 4 & 2 & 2 & 2 \\ 2 & 2 & 2 & 2 & 4 & 2 & 2 \\ 2 & 2 & 2 & 2 & 2 & 4 & 2 \\ 2 & 2 & 2 & 2 & 2 & 2 & 4 \end{pmatrix} \]

Which clearly results in each row \(j\) and each column \(k\) both representing an object, and the \((j,k)\) entry is the number of weighings that include both objects as defined by:

\[ (X^{T}X)_{jk} = \sum_{i=1}^{8}x_{ij}x_{ik} \]

It is clear to see that when \(j = k\) the object has been in 4 weighings - this is true for every object. It is also clear to see that when \(j \not = k\) that the objects are in exactly 2 weighings with each other - this is also true for every pair of objects.

Therefore, our interpretations are that:

  • the diagonal represents how well each object is measured individually, and
  • the off-diagonal represents how entangled pairs of objects are

We continue with the inverse normal matrix, which helps us better understand the unvertainty of the design.

\[ (X^\top X)^{-1} = \frac{1}{16} \begin{pmatrix} 7 & -1 & -1 & -1 & -1 & -1 & -1 \\ -1 & 7 & -1 & -1 & -1 & -1 & -1 \\ -1 & -1 & 7 & -1 & -1 & -1 & -1 \\ -1 & -1 & -1 & 7 & -1 & -1 & -1 \\ -1 & -1 & -1 & -1 & 7 & -1 & -1 \\ -1 & -1 & -1 & -1 & -1 & 7 & -1 \\ -1 & -1 & -1 & -1 & -1 & -1 & 7 \end{pmatrix} \]

As expected, we see symmetry and our interpretations are:

  • the diagonal represents the noise of each individual objects; the higher the value the harder to estimate (i.e., relative variance)
    • “How much estimation uncertainty object j gets per unit of measurement noise”
  • the off-diagonal represents the entanglement between objects and how uncertainty is shared

\[ \text{Cov}(\hat{\beta}_j,\hat{\beta}_k) = -\frac{\sigma^2}{16} \]

We then proceed to the least-squares operator

\[ (X^\top X)^{-1}X^\top = \frac{1}{16} \begin{pmatrix} 1 & 5 & 5 & 5 & -3 & -3 & -3 & -3 \\ 1 & 5 & -3 & -3 & 5 & 5 & -3 & -3 \\ 1 & 5 & -3 & -3 & -3 & -3 & 5 & 5 \\ 1 & -3 & 5 & -3 & 5 & -3 & 5 & -3 \\ 1 & -3 & 5 & -3 & -3 & 5 & -3 & 5 \\ 1 & -3 & -3 & 5 & 5 & -3 & -3 & 5 \\ 1 & -3 & -3 & 5 & -3 & 5 & 5 & -3 \end{pmatrix} \]

Which, when multiplied by \(\underset{n \times 1}{y}\) gives us our least squares estimators:

\[ \hat{\beta}_1 = \frac{1}{16} \bigl( y_1 + 5y_2 + 5y_3 + 5y_4 - 3y_5 - 3y_6 - 3y_7 - 3y_8 \bigr) \] \[ \hat{\beta}_2 = \frac{1}{16} \bigl( y_1 + 5y_2 - 3y_3 - 3y_4 + 5y_5 + 5y_6 - 3y_7 - 3y_8 \bigr) \]

\[ \hat{\beta}_3 = \frac{1}{16} \bigl( y_1 + 5y_2 - 3y_3 - 3y_4 - 3y_5 - 3y_6 + 5y_7 + 5y_8 \bigr) \]

\[ \hat{\beta}_4 = \frac{1}{16} \bigl( y_1 - 3y_2 + 5y_3 - 3y_4 + 5y_5 - 3y_6 + 5y_7 - 3y_8 \bigr) \]

\[ \hat{\beta}_5 = \frac{1}{16} \bigl( y_1 - 3y_2 + 5y_3 - 3y_4 - 3y_5 + 5y_6 - 3y_7 + 5y_8 \bigr) \]

\[ \hat{\beta}_6 = \frac{1}{16} \bigl( y_1 - 3y_2 - 3y_3 + 5y_4 + 5y_5 - 3y_6 - 3y_7 + 5y_8 \bigr) \]

\[ \hat{\beta}_7 = \frac{1}{16} \bigl( y_1 - 3y_2 - 3y_3 + 5y_4 - 3y_5 + 5y_6 + 5y_7 - 3y_8 \bigr) \]

(b) Hotelling’s Procuedure

NoteExercise Instructions

Write Hotelling’s procedure in matrix form

Hotelling’s procedure can be written using the matrix form:

\[ y = X\beta + \epsilon \] \[ \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \\ y_8 \end{pmatrix} = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & -1 & -1 & -1 & -1 \\ 1 & -1 & -1 & 1 & 1 & -1 & -1 \\ 1 & -1 & -1 & -1 & -1 & 1 & 1 \\ -1 & 1 & -1 & 1 & -1 & 1 & -1 \\ -1 & 1 & -1 & -1 & 1 & -1 & 1 \\ -1 & -1 & 1 & 1 & -1 & -1 & 1 \\ -1 & -1 & 1 & -1 & 1 & 1 & -1 \end{pmatrix} \begin{pmatrix} \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \beta_5 \\ \beta_6 \\ \beta_7 \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \\ \epsilon_5 \\ \epsilon_6 \\ \epsilon_7 \\ \epsilon_8 \end{pmatrix} \]

Let \(Z\) be the design matrix for Hotteling’s procedure, where each row \(i\) corresponds to an individual instance of weighing, and where each column \(j\) represents a “light object”.

\[ Z = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & -1 & -1 & -1 & -1 \\ 1 & -1 & -1 & 1 & 1 & -1 & -1 \\ 1 & -1 & -1 & -1 & -1 & 1 & 1 \\ -1 & 1 & -1 & 1 & -1 & 1 & -1 \\ -1 & 1 & -1 & -1 & 1 & -1 & 1 \\ -1 & -1 & 1 & 1 & -1 & -1 & 1 \\ -1 & -1 & 1 & -1 & 1 & 1 & -1 \end{pmatrix} \]

Then each entry in the design matrix is defined as

\[ z_{ij} = \begin{cases} 1 &\text{object } j \text{ is on the left pan of the scale in weighing } i \\ -1 &\text{object } j \text{ is on the right pan of the scale in weighing } i \\ \end{cases} \]

(c) Variance Comparison

NoteExercise Instructions

Find the variance of a weight using Yates’ and Hotelling’s procedures.

As we already have \[ y = X\beta + \epsilon, \qquad \mathbb{E}[\epsilon]=0, \qquad \text{Var}(\epsilon)=\sigma^2I, \]

and we’ve already computed the least squares estimator

\[ \hat{\beta} = (X^{T}X)^{-1}X^{T}y \]

and we’ve already computed the variance-covariance matrix

\[ \text{Var}(\hat{\beta})=\sigma^2(X^{T}X)^{-1} \]

Thus the diagonal entries of the variance covariance matrix are the variances \(\text{Var}(\hat{\beta_j})\)

For Yates, we’ve already done this

\[ \text{Var}(\hat{\beta_j}) = \sigma^2\frac{7}{16} \quad \text{for all } j = 1,...,7 \]

For Hotelling we will follow the same steps as we did previously (i.e., design matrix –> normal matrix –> inverse normal matrix)

Code
Hotelling <- matrix(
    c(
    1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, -1, -1, -1, -1,
    1, -1, -1, 1, 1, -1, -1,
    1, -1, -1, -1, -1, 1, 1,
    -1, 1, -1, 1, -1, 1, -1,
    -1, 1, -1, -1, 1, -1, 1,
    -1, -1, 1, 1, -1, -1, 1,
    -1, -1, 1, -1, 1, 1, -1
  ),
  nrow = 8,
  ncol = 7,
  byrow = TRUE
)

# Z (i.e., Design Matrix)
Z <- Hotelling
Z
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1    1    1    1    1    1    1
[2,]    1    1    1   -1   -1   -1   -1
[3,]    1   -1   -1    1    1   -1   -1
[4,]    1   -1   -1   -1   -1    1    1
[5,]   -1    1   -1    1   -1    1   -1
[6,]   -1    1   -1   -1    1   -1    1
[7,]   -1   -1    1    1   -1   -1    1
[8,]   -1   -1    1   -1    1    1   -1
Code
# Z'Z (i.e., ZY; Normal Matrix)
ZY <- t(Z) %*% Z
ZY
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    8    0    0    0    0    0    0
[2,]    0    8    0    0    0    0    0
[3,]    0    0    8    0    0    0    0
[4,]    0    0    0    8    0    0    0
[5,]    0    0    0    0    8    0    0
[6,]    0    0    0    0    0    8    0
[7,]    0    0    0    0    0    0    8
Code
# (Z'Z)^-1 (i.e., ZW; Inverse Normal Matrix)
ZW <- solve(ZY)
ZW
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,] 0.125 0.000 0.000 0.000 0.000 0.000 0.000
[2,] 0.000 0.125 0.000 0.000 0.000 0.000 0.000
[3,] 0.000 0.000 0.125 0.000 0.000 0.000 0.000
[4,] 0.000 0.000 0.000 0.125 0.000 0.000 0.000
[5,] 0.000 0.000 0.000 0.000 0.125 0.000 0.000
[6,] 0.000 0.000 0.000 0.000 0.000 0.125 0.000
[7,] 0.000 0.000 0.000 0.000 0.000 0.000 0.125
Code
# (Z'Z)^-1 * Z' (i.e., ZA; Least-Squares Operator)
ZA <- ZW %*% t(Z)
ZA
      [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]
[1,] 0.125  0.125  0.125  0.125 -0.125 -0.125 -0.125 -0.125
[2,] 0.125  0.125 -0.125 -0.125  0.125  0.125 -0.125 -0.125
[3,] 0.125  0.125 -0.125 -0.125 -0.125 -0.125  0.125  0.125
[4,] 0.125 -0.125  0.125 -0.125  0.125 -0.125  0.125 -0.125
[5,] 0.125 -0.125  0.125 -0.125 -0.125  0.125 -0.125  0.125
[6,] 0.125 -0.125 -0.125  0.125  0.125 -0.125 -0.125  0.125
[7,] 0.125 -0.125 -0.125  0.125 -0.125  0.125  0.125 -0.125

\[ \begin{align} Z &= \begin{pmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & -1 & -1 & -1 & -1 \\ 1 & -1 & -1 & 1 & 1 & -1 & -1 \\ 1 & -1 & -1 & -1 & -1 & 1 & 1 \\ -1 & 1 & -1 & 1 & -1 & 1 & -1 \\ -1 & 1 & -1 & -1 & 1 & -1 & 1 \\ -1 & -1 & 1 & 1 & -1 & -1 & 1 \\ -1 & -1 & 1 & -1 & 1 & 1 & -1 \end{pmatrix} \\ Z^{T}Z &= \begin{pmatrix} 8 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 8 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 8 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 8 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 8 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 8 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 8 \end{pmatrix} \\ (Z^\top Z)^{-1} &= \frac{1}{8} \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix} \end{align} \]

Therefore

\[ \text{Var}(\hat{\beta_j})=\frac{\sigma^2}{8} \quad \text{for all } j = 1,...,7 \]

(d) Recommendation

NoteExercise Instructions

If the chemist wanted estimates of the weights with the highest precision, then which procedure (Yates or Hotelling) would you recommend that the chemist use to weigh objects? Explain your reasoning.

To keep the estimates of the objects as precise as possible, we want the procedure with the lowest variance, which would be Hotelling’s.

Exercise 2.12

NoteExercise Instructions

Does life satisfaction change by region over time? Use the lifesat_childmort data from Example 2.15 to explore this question.

Code
audit_raw <- lifesat_childmort |>
  summarise(
    n_rows = n(),
    n_lifesat = sum(!is.na(LifeSatisfaction)),
    pct_lifesat_missing = mean(is.na(LifeSatisfaction)) * 100
  )

audit_raw
# A tibble: 1 × 3
  n_rows n_lifesat pct_lifesat_missing
   <int>     <int>               <dbl>
1  22595      1704                92.5
Code
cleanData_lifesat_childmort <- lifesat_childmort |>
  filter(!is.na(LifeSatisfaction))

cleanData2_lifesat_childmort <- cleanData_lifesat_childmort |>
  mutate(
    Continent2 = suppressWarnings(
      countrycode(`Country Code`, origin = "iso3c", destination = "continent")
    ),
    Continent2 = factor(Continent2),
    Year_f = factor(Year)
  ) |>
  filter(!is.na(Continent2)) |>
  select(Continent2, Year, Year_f, LifeSatisfaction)

audit_clean <- cleanData2_lifesat_childmort |>
  summarise(
    n_rows = n(),
    n_years = n_distinct(Year),
    n_continents = n_distinct(Continent2)
  )

audit_clean
# A tibble: 1 × 3
  n_rows n_years n_continents
   <int>   <int>        <int>
1   1686      14            5
Code
cleanData2_lifesat_childmort |>
  ggplot(aes(LifeSatisfaction)) +
  geom_density() +
  geom_vline(xintercept = mean(cleanData2_lifesat_childmort$LifeSatisfaction), linetype = "dashed") +
  labs(title = "Distribution of Life Satisfaction", x = "Life Satisfaction", y = "Density")
Figure 1: Distribution of Life Satisfaction
Code
cleanData2_lifesat_childmort |>
  ggplot(aes(x = Year, y = LifeSatisfaction, colour = Continent2)) +
  geom_jitter(width = 0.2, alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Life Satisfaction Over Time by Continent",
    x = "Year",
    y = "Life Satisfaction",
    colour = "Continent"
  )
`geom_smooth()` using formula = 'y ~ x'
Figure 2: Life Satisfaction Over Time by Continent
Code
m1 <- lm(LifeSatisfaction ~ Year + Continent2, data = cleanData2_lifesat_childmort)
m2 <- lm(LifeSatisfaction ~ Year * Continent2, data = cleanData2_lifesat_childmort)

summary(m1)

Call:
lm(formula = LifeSatisfaction ~ Year + Continent2, data = cleanData2_lifesat_childmort)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.79295 -0.58791 -0.03457  0.59240  2.16846 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -8.356093  11.214322  -0.745    0.456    
Year                0.006307   0.005572   1.132    0.258    
Continent2Americas  1.785698   0.064768  27.571   <2e-16 ***
Continent2Asia      0.936668   0.055520  16.871   <2e-16 ***
Continent2Europe    1.782287   0.057309  31.100   <2e-16 ***
Continent2Oceania   2.975140   0.176759  16.832   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8421 on 1680 degrees of freedom
Multiple R-squared:  0.4425,    Adjusted R-squared:  0.4409 
F-statistic: 266.7 on 5 and 1680 DF,  p-value: < 2.2e-16
Code
# Compare whether continent-specific time trends improve fit
anova(m1, m2)
Analysis of Variance Table

Model 1: LifeSatisfaction ~ Year + Continent2
Model 2: LifeSatisfaction ~ Year * Continent2
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1   1680 1191.4                           
2   1676 1190.6  4   0.75125 0.2644 0.9009

Call:
lm(formula = LifeSatisfaction ~ Year * Continent2, data = cleanData2_lifesat_childmort)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.78402 -0.59173 -0.03319  0.60420  2.16141 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)             -10.453258  22.821658  -0.458    0.647
Year                      0.007349   0.011339   0.648    0.517
Continent2Americas        8.295588  35.084814   0.236    0.813
Continent2Asia           15.314280  30.687631   0.499    0.618
Continent2Europe        -12.568199  31.532121  -0.399    0.690
Continent2Oceania        32.253549  92.500633   0.349    0.727
Year:Continent2Americas  -0.003235   0.017436  -0.186    0.853
Year:Continent2Asia      -0.007145   0.015249  -0.469    0.639
Year:Continent2Europe     0.007131   0.015668   0.455    0.649
Year:Continent2Oceania   -0.014550   0.045969  -0.317    0.752

Residual standard error: 0.8428 on 1676 degrees of freedom
Multiple R-squared:  0.4429,    Adjusted R-squared:  0.4399 
F-statistic:   148 on 9 and 1676 DF,  p-value: < 2.2e-16

Interpretation

In a multiple linear regression allowing continent-specific time trends, none of the interaction terms between Year and Continent were statistically significant. This suggests that while average levels of life satisfaction differ substantially across continents, the rate of change over time is broadly similar across regions. Overall, life satisfaction appears relatively stable over the observed period.

Appendix: References

The following is best to ignore. It’s code I ran and didn’t end up using in the final project, but haven’t removed into somewhere else … yet.

Code
skim(lifesat_childmort)
Data summary
Name lifesat_childmort
Number of rows 22595
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Country 0 1.00 4 50 0 332 0
Country Code 1318 0.94 3 8 0 286 0
Continent 22310 0.01 4 13 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1.00 1960.63 51.06 1700.00 1946.00 1.974e+03 1997.00 2.018000e+03 ▁▁▁▂▇
Tot_pop 2514 0.89 20772028.22 84002122.43 0.00 511000.00 3.518e+06 11253665.00 1.359368e+09 ▇▁▁▁▁
LifeSatisfaction 20891 0.08 5.44 1.12 2.66 4.61 5.340e+00 6.27 8.020000e+00 ▁▆▇▅▃
Under5mort 11389 0.50 75.64 74.94 2.10 18.10 4.810e+01 110.80 4.419000e+02 ▇▂▁▁▁
Code
cleanData_lifesat_childmort <-lifesat_childmort |> 
    filter(!is.na(LifeSatisfaction))

skim(cleanData_lifesat_childmort)
Data summary
Name cleanData_lifesat_childmo…
Number of rows 1704
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Country 0 1.00 4 28 0 164 0
Country Code 0 1.00 3 8 0 164 0
Continent 1561 0.08 4 13 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1.00 2012.33 3.69 2005.00 2009.00 2012.00 2015.00 2.018000e+03 ▅▆▆▇▇
Tot_pop 736 0.57 52336410.75 166399244.90 293124.00 5358910.00 12442635.50 37655194.50 1.359368e+09 ▇▁▁▁▁
LifeSatisfaction 0 1.00 5.44 1.12 2.66 4.61 5.34 6.27 8.020000e+00 ▁▆▇▅▃
Under5mort 178 0.90 33.69 36.93 2.10 6.60 17.80 50.27 2.109000e+02 ▇▂▁▁▁
Code
cleanData2_lifesat_childmort <- cleanData_lifesat_childmort |> 
    mutate(
        Continent2 = factor(suppressWarnings(
            countrycode(`Country Code`, origin = "iso3c", destination = "continent")
        )),
        Year_f = factor(Year)
    ) |> 
    filter(!is.na(Continent2)) |> 
    select(Continent2, Year, Year_f, LifeSatisfaction)
skim(cleanData2_lifesat_childmort)
Data summary
Name cleanData2_lifesat_childm…
Number of rows 1686
Number of columns 4
_______________________
Column type frequency:
factor 2
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Continent2 0 1 FALSE 5 Asi: 512, Eur: 445, Afr: 420, Ame: 285
Year_f 0 1 FALSE 14 201: 146, 201: 145, 201: 143, 201: 141

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2012.32 3.69 2005.00 2009.00 2012.00 2015.00 2018.00 ▅▆▆▇▇
LifeSatisfaction 0 1 5.44 1.13 2.66 4.61 5.33 6.28 8.02 ▂▆▇▆▃
Code
# cleanData2_lifesat_childmort |> 
#     ggplot(aes(Year_f, LifeSatisfaction, color = Continent2)) +
#         geom_dotplot() 
#         #facet_grid(Continent2 ~ .)

cleanData2_lifesat_childmort |> 
    ggplot(aes(Year_f)) + 
        geom_bar()

Code
cleanData2_lifesat_childmort |> 
    ggplot(aes(Continent2)) + 
        geom_bar()

Code
# cleanData2_lifesat_childmort |> 
#     ggplot(aes(LifeSatisfaction)) + 
#         geom_dotplot(dotsize = 0.25)

# cleanData2_lifesat_childmort |> 
#     ggplot(aes(LifeSatisfaction)) + 
#         geom_freqpoly()

cleanData2_lifesat_childmort |> 
    ggplot(aes(LifeSatisfaction)) + 
        geom_density() +
        geom_vline(xintercept = mean(cleanData2_lifesat_childmort$LifeSatisfaction))

Code
m1 <- lm(LifeSatisfaction ~ Year + Continent2,
            data = cleanData2_lifesat_childmort)
summary(m1)

Call:
lm(formula = LifeSatisfaction ~ Year + Continent2, data = cleanData2_lifesat_childmort)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.79295 -0.58791 -0.03457  0.59240  2.16846 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -8.356093  11.214322  -0.745    0.456    
Year                0.006307   0.005572   1.132    0.258    
Continent2Americas  1.785698   0.064768  27.571   <2e-16 ***
Continent2Asia      0.936668   0.055520  16.871   <2e-16 ***
Continent2Europe    1.782287   0.057309  31.100   <2e-16 ***
Continent2Oceania   2.975140   0.176759  16.832   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8421 on 1680 degrees of freedom
Multiple R-squared:  0.4425,    Adjusted R-squared:  0.4409 
F-statistic: 266.7 on 5 and 1680 DF,  p-value: < 2.2e-16
Code
m2 <- lm(LifeSatisfaction ~ Year * Continent2,
            data = cleanData2_lifesat_childmort)
summary(m2)

Call:
lm(formula = LifeSatisfaction ~ Year * Continent2, data = cleanData2_lifesat_childmort)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.78402 -0.59173 -0.03319  0.60420  2.16141 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)             -10.453258  22.821658  -0.458    0.647
Year                      0.007349   0.011339   0.648    0.517
Continent2Americas        8.295588  35.084814   0.236    0.813
Continent2Asia           15.314280  30.687631   0.499    0.618
Continent2Europe        -12.568199  31.532121  -0.399    0.690
Continent2Oceania        32.253549  92.500633   0.349    0.727
Year:Continent2Americas  -0.003235   0.017436  -0.186    0.853
Year:Continent2Asia      -0.007145   0.015249  -0.469    0.639
Year:Continent2Europe     0.007131   0.015668   0.455    0.649
Year:Continent2Oceania   -0.014550   0.045969  -0.317    0.752

Residual standard error: 0.8428 on 1676 degrees of freedom
Multiple R-squared:  0.4429,    Adjusted R-squared:  0.4399 
F-statistic:   148 on 9 and 1676 DF,  p-value: < 2.2e-16
Code
#levels(as.factor(cleanData2_lifesat_childmort$Continent2))


# summarise_numeric <- function(data, var) {
#   data |>
#     summarise(
#       Variable = var,
#       Type = "Numeric",
#       N = sum(!is.na(.data[[var]])),
#       Missing = sum(is.na(.data[[var]])),
#       Percent_Missing = mean(is.na(.data[[var]])) * 100,
#       Unique = n_distinct(.data[[var]], na.rm = TRUE),
#       Mean = mean(.data[[var]], na.rm = TRUE),
#       Median = median(.data[[var]], na.rm = TRUE),
#       SD = sd(.data[[var]], na.rm = TRUE),
#       Min = min(.data[[var]], na.rm = TRUE),
#       Max = max(.data[[var]], na.rm = TRUE)
#     )
# }

# summarise_categorical <- function(data, var) {
#   top_level <- data |>
#     filter(!is.na(.data[[var]])) |>
#     count(.data[[var]], sort = TRUE) |>
#     slice(1)

#   data |>
#     summarise(
#       Variable = var,
#       Type = "Categorical",
#       N = sum(!is.na(.data[[var]])),
#       Missing = sum(is.na(.data[[var]])),
#       Percent_Missing = mean(is.na(.data[[var]])) * 100,
#       Unique = n_distinct(.data[[var]], na.rm = TRUE),
#       Mode = top_level[[var]],
#       Mode_Count = top_level$n
#     )
# }

# summary_raw <- bind_rows(
#     summarise_categorical(lifesat_childmort, "Country"),
#     summarise_categorical(lifesat_childmort, "Country Code"),
#     summarise_categorical(lifesat_childmort, "Continent"),
#     summarise_numeric(lifesat_childmort, "Year"),
#     summarise_numeric(lifesat_childmort, "Tot_pop"),
#     summarise_numeric(lifesat_childmort, "LifeSatisfaction"),
# )
# summary_raw

# cleanData_lifesat_childmort <-lifesat_childmort |> 
#     mutate(
#         is_owid = grepl("^OWID_", `Country Code`),
#         Continent2 = if_else(
#             is_owid,
#             NA_character_,
#             countrycode(`Country Code`, origin = "iso3c", destination = "continent")
#         )
#     ) |> 
#     filter(!is.na(Continent2)) |> 
#     select(Continent2, Year, LifeSatisfaction) |> 
#     mutate(Continent2 = factor(Continent2)) |> 
#     group_by(Continent2, Year) |> 
#     arrange(Continent2, Year) |> 
#     summarise(
#         'Mean Life Satisfaction' = mean(LifeSatisfaction, na.rm = T),
#         n_lifesat = sum(!is.na(LifeSatisfaction)),
#         .groups = "drop")

# lifesat_childmort |> 
#     mutate(
#         is_owid = grepl("^OWID_", `Country Code`),
#         Continent2 = if_else(
#             is_owid,
#             NA_character_,
#             countrycode(`Country Code`, origin = "iso3c", destination = "continent")
#         )
#     ) |> 
#     filter(!is.na(Continent2)) |> 
#     summarise(
#         'Years Life Satisfaction' = n_distinct(Year[!is.na(LifeSatisfaction)])
#     )

# lifesat_childmort |> 
#     ggplot(aes(x = Country)) +
#         geom_bar()




#cleanData_lifesat_childmort
#evels(as.factor(cleanData_lifesat_childmort$Continent2))